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(57) Abstract: A method and apparatus is described for identifying a subset of components of a system, the subset being capable of 
predicting a feature of a test sample. The method comprises generating a linear combination of components and component weights 
in which values for each component are determined from data generated from a plurality of training samples, each training sample 
having a known feature. A mode] is defined for the probability distribution of a feature wherein the model is conditional on the 
linear combination and wherein the model is not a combination of a binomial distribution for a two class response with a probit 
function linking the linear combination and the expectation of the response. A prior distribution is constructed for the component 
weights of the linear combination comprising a hyperprior having a high probability density close to zero, and the prior distribution 
and the model are combined to generate a posterior distribution. A subset of components is identified having component weights 
that maximise the posterior distribution. 
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METHOD AND APPARATUS FOR IDENTIFYING DIAGNOSTIC COMPONENTS 

OF A SYSTEM 

FIELD OF THE INVENTION 

The present invention relates to a method and apparatus 
5 for identifying components of a system from data generated 
from samples from the system, which components are capable 
of predicting a feature of the sample within the system 
and, particularly, but not exclusively, the present 
invention relates to a method and apparatus for 
10 identifying components of a biological system from data 
generated by a biological method, which components are 
capable of predicting a feature of interest associated 
with a sample from the biological system. 

15 BACKGROUND OF THE INVENTION 

There are any number of "systems" in existence which can 
be classified into different features of interest. The 
term "system" essentially includes all types of systems 
for which data can be provided, including chemical 

.2 0 systems, financial systems (e.g. credit systems for 

individuals, groups or organisations, loan histories) , 
geological systems, and many more. It is desirable to be 
able to utilise data generated from the systems (e.g. 
statistical data) to identify particular features of 

25 samples from the system (e.g. to assist with analysis of a 
financial system to identify the groups which exist in the 
financial system (e.g. in very simple terms those who have 
"good" credit and those who are a credit risk) . Where 
there is a large amount of statistical data, the 

3 0 identification of components from that data which are 

predictive of a particular feature of a sample from the 
system is a difficult task, generally because there is a 
large amount of data to process, the majority of which may 
not provide any indication or little indication of the 

35 features of interest of a particular sample from which the 
data is taken, in addition, components that' are 
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identified using training sample data are often 
ineffective at identifying features on test samples data 
when the test sample data has a high degree of 
variability relative to the training sample data, . This 
is often the case in situations when, for example, data 
is obtained from many different sources, as it is often 
impossible to control the conditions under which the data 
is collected from each individual source. 

An example of a type of system where these problems are 
particularly pertinent, is a biological system and the 
following description refers specifically to biological 
systems . The present invention is not limited to use 
with biological systems, however, and it has general 
application to any system. 

Recent advances in biotechnology have resulted in the 
development of biological methods for large scale 
screening of systems and analysis of samples. Such 
methods include, for example, DNA, RNA or antibody 
microarray analysis, proteomics analysis, proteomics 
electrophoresis gel analysis and high throughput 
screening techniques. These types of methods often 
result in the generation of data that can have up to 
30,000 or more components for each sample that is tested. 

It is~ obviously important to be able to identify features 
of interest in samples from biological systems. For 
example, to classify groups such as "diseased" and "non- 
diseased" . Many of these biological methods would be 
useful as diagnostic tools predicting features of a 
sample in the biological^ systems (e.g. for identifying 
diseases by screening tissues or body fluids, or as tools 
for determining, for example, the efficacy of 
pharmaceutical compounds) . 

Use of biological methods such as biotechnology arrays in 
such applications to date, has been limited owing to the 
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large amount of data that is generated from these types 
of methods ,. and the lack of efficient methods for 
screening the data for meaningful results . Consequently , 
analysis of biological data using prior art methods 
either fails tp make full use of the information inn the 
data, or is time consuming, prone to false positive and 
negative results and requires large amounts of computer 
memory if a meaningful result is to be obtained from the 
data. This is problematic in large scale screening 
scenarios where rapid and accurate screening is required. 

There is therefore a need for an improved method, in 
particular for analysis of biological data, and, more 
generally, for an improved method of analysing data from 
any system in order to predict a feature of interest for 
a sample from the system. . 

SUMMARY OF THE INVENTION 

In a first aspect, the invention provides a method f or. . 
identifying- a subset of components of a system, the 
subset being capable of predicting a feature of a test 
sample, the method comprising the steps of; 

(a) generating a linear combination of components and . 
component weights in which values for each component 
are determined from data generated from a plurality 
of training samples, each training sample having a 
known feature; 

. (b) defining a model' for the probability distribution of 
a feature wherein the model is conditional on the 
linear combination and .wherein the model is not a 
combination of a binomial distribution for a two 
class response with a probit function linking the 
linear combination and the expectation of the 
response ; 

(c) constructing a prior distribution for the component 
weights of the linear combination comprising a 
hyperprior having a high probability density close 



WO 03/034270 



PCT/AU02/01417 



4 

to zero; 

(d) combining the prior distribution and the model to 
generate a posterior distribution; - 
' (e) identifying a subset of components having component 
weights that maximise the posterior distribution. 

The method utilises training samples having a known 
feature in order to identify a subset of components which 
can predict a feature for a training sample. 
Subsequently, knowledge of the subset of components can 
be used for tests, for example clinical tests, to predict 
a feature such as whether a tissue . sample is malignant or 
benign, or what is the weight of a tumour, or provide an 
estimated time for survival of a patient having a" 
particular condition. As used herein, the term "feature" 
refers to any response or identifiable trait or character 
that is associated with a sample./ For example, a feature 
may be a- particular time to an event for a particular 
sample, or the size or quantity of a sample, or the class 
or group into which a sample can be classified* 

The method of the present invention estimates the 
component weights utilising a Bayesian statistical 
method. Preferably, where there are a large amount of 
components generated from the system (which will usually 
be the case for the method of the present invention to be 
effective) the method preferably makes an a priori 
assumption that the majority of the components are 
unlikely to be components that will form part of the 
subset of components for predicting a feature. The 
assumption is therefore made that the majority of 
component weights are likely to be zero. A model is' 
constructed which,/ with this assumption in mind, sets the 
component weights so that the posterior probability of 
the weights is maximised. Components having a weight 
below a pre-detertniried threshold (which will be the 
majority of them ih accordance with the a priori 
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assumption) are dispensed with. The process is iterated 
until the remaining diagnostic components are identified. - 
This method is quick, mainly because of the a priori - 
assumption which results in rapid elimination of the 
majority of components. 

Most features of a system typically exhibit a probability 
distribution, and. the probability distribution of a 
feature can be modelled using statistical models which 
are based on the data generated from the training 
samples. The method of the invention utilises 
statistical models which model the probability 
distribution for a feature of interest or a series of 
features of interest. Thus, for a feature of interest 
having a particular probability distribution, an 
appropriate model is defined that models^ that 
distribution. The method may use any model that is 
conditional on the linear combination, and is preferably 
a mathematical equation in the form of a likelihood 
function that provides a probability distribution based 
on the data obtained from the training samples . 
Preferably, the likelihood function is based on a 
previously described model for describing some 
probability distribution. In one embodiment, the model is 
a likelihood function based on a model selected from the 
group consisting of a multinomial or binomial logistic 
regression, generalised linear model, Cox's proportional 
hazards model, accelerated failure model, parametric 
survival model, a chi-squared distribution model or an 
exponential distribution model. 

In one embodiment, the likelihood function is based on a 
multinomial or binomial logistic regression. The 
binomial or multinomial logistic regression preferably 
models a feature having a multinomial or binomial 
distribution. A binomial distribution is a statistical 
distribution having two possible classes or groups such 
as an on/off state. Examples of such groups include 
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dead/alive, improved/not improved, depressed/not. 
depressed. A multinomial distribution is a 
generalisation of the binomial distribution in which a 
plurality of classes or groups are possible for each of a 
plurality of samples, or in other words, a sample may be 
classified into one of a plurality of classes or groups. 
• Thus, by defining a likelihood function based on a 
multinomial or binomial logistic regression, it is 
possible to identify subsets of components that are 
capable of classifying a sample into one of a plurality 
of pre-defined groups or classes. To do this, training 
samples are grouped into a plurality of sample groups (or 
"classes") based on a predetermined feature of the 
training samples in which the members of each sample 
group have a common feature and are assigned a common 
group identifier. A likelihood function is formulated 
based on a multinomial or binomial logistic regression 
conditional on the linear combination (which incorporates 
the data generated from the grouped training samples) . 
The feature may be any desired classification by which 
the training samples are to be grouped. For example , the 
features for classifying tissue samples may be that the 
. tissue is normal, malignant or benign; the feature for 
classifying cell samples may be that the cell is a 
leukemia cell or a healthy cell, that the training 
samples are obtained from the blood of patients having or 
not having a certain condition, or that the training 
samples are from a cell from one of several types of 
cancer as compared to a normal cell . 

Preferably, the likelihood function based on the logistic 
regression is of the form: 
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wherein 

xf p g is a linear combination generated from input . data 

from training sample i with component weights J3 g ; 

xf is the components for the i th Row of X and p g is a set 

of component weights for sample class gr; 

^•g=l if training sample i is a member of class g, ^g=0 

otherwise; 

and 

X is data from n training samples comprising p 
components. 

In another embodiment, the likelihood function is based 
on an ordered categorical logistic regression. The 
ordered categorical logistic regression models a 
multinomial distribution in which the classes are in a 
particular order (ordered classes such as for example, 
classes of increasing or decreasing disease severity) . 
By defining a likelihood function based on an ordered 
categorical logistic regression 1 , it is possible to 
identify a subset of components that is capable of 
classifying a sample into a class wherein the class is 
one of a plurality of predefined ordered classes. By 
defining a series of group identifiers in which each 
group identifier corresponds to a member of an ordered 
class, and grouping the training samples into one of the 
ordered classes based on predetermined features of the 
training samples, a likelihood function can be formulated 
based on a categorical ordered logistic regression which ' 
is conditional on the linear combination {which 
incorporates the data, generated from the grouped training 
samples) . 

Preferably, the likelihood function based on the 
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£=nn 



1=1 k=\ 

r 

log i t 



Tik+\ Yik 

\ Tik+\ j 



= log it 



Wherein 

y ik is the probability that training sample i belongs to a 
class with identifier less than or equal to k (where the 
total of ordered classes is G ) ; 

xj P* is a linear combination generated from input data 
from training sample i with component weights 
. x[ is the components for the 1 th Row of X ; 



rij is as defined as; 



8=1 



where 



if observation i in class j 
otherwise 



In another embodiment of the present invention, the 
likelihood function is based on a generalised linear 
model. The generalised linear model preferably models a 
feature which has a distribution belonging to the regular 
exponential family of distributions Examples of 
regular exponential family distributions include normal 
distribution, Gaussian distribution, Poisson 
distribution, gamma distribution and inverse gamma 
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distribution. Thus, in another embodiment of the method 
of the invention, a subset of components is identified 
' that is capable of predicting a predefined characteristic 
of a sample that lies within a regular exponential family 
of distributions by defining a generalised linear model 
which models the characteristic to be predicted. 
Examples of a characteristic that may be predicted using 
a generalised linear model include any quantity of a 
sample that exhibits the specified distribution such as, 
for example, the weight, size, counts, group membership 
or other dimensions or quantities or properties of a 
sample. 

Preferably, the generalised linear model is of the form: 
log P(y 1 ft, cp) =±{ yA ~ b f K eOrf) } 

Wherein 

Y 88 (yi,~r y n ) T , and yi is the characteristic measured 
on the i th sample; 

ai(0) = $ / Wi with the w ± being a fixed set of known 
weights and <£ a single scale parameter ; 

the functions b(.) and c ( . ) are preferably as defined by 

Nelder and Wedderburn (1972) ; 
Preferably, 

_E{y i }=W(e i ) 

Var{y}=b^(9 i )a i ((p) = T l 2 a i (9) • 

Preferably, each observation has a set of covariates ; Xi 
and a linear predictor t?i = Xi T 0. The relationship 
between the mean of the i th observation and its linear 
predictor is preferably given by the link function 
77. =r g(/ / i) == g(^'(^)) • The inverse of the link is denoted by 
h, which is preferably: 

E{y i } = b , (e i )=h( 7i ). 
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In another, embodiment , the method of the present 
invention may be used to predict the time to an event for 
a sample by utilising a likelihood function based on a 
hazard model which preferably estimates the probability 
of a time to an event given that the event has not taken 
place at the time of obtaining the data. In one 
embodiment, the likelihood function is based on a model 
selected from the group consisting of Cox' s proportional 
hazards model, parametric survival model and accelerated 
failure times model. Cox's proportional hazards model 
permits the time to an event to be modelled on a set of 
components and component weights without making 
restrictive assumptions about the form of the- hazard 
function . The accelerated failure model is a general 
model for da:ta consisting of survival times in which the 
component measurements are assumed to act 
multiplicatively on the time-scale, and so affect the 
rate at which an individual proceeds along the time axis. 
Thus, the accelerated survival model can be interpreted 
in terms of the speed of progression of, for example, 
disease. The parametric survival model is one in which 
the distribution function for the time to an event (eg 
survival time) is modelled by a known distribution or has 
a specified parametric formulation. Among the commonly 
used survival distributions are the Weibull, exponential 
and extreme value distributions . 

Preferably, a subset of components capable of predicting 
the time to an' event for a sample is identified by 
defining a likelihood based on Cox's proportional hazards 
model, a parametric survival model or an accelerated 
survival times model, which comprises measuring the time 
elapsed for a plurality of samples from the time the 
sample is obtained to the time of the event. 

Preferably, the likelihood function for predicting the 
time to an event is of the form: 
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N 



Log (Partial) Likelihood ?]^gj(/^ff;.Ar.J>is) 

where J3 1 =[P\. Pir~ , P p ) and <p A =(ft,ft,-,^)are the model 
parameters. 

Preferably, the likelihood function based on Cox's 
proportional hazards model is of the form: 



N 



L {i\P) = Yl 



exp 



Where Z is preferably a matrix that is the re -arrangement 
of the rows of X where the ordering of the rows of Z 
corresponds to the ordering induced by the ordering of 
the survival times and d is the result of ordering the 
censoring index with the same permutation required to 
order survival times. Also Zy is the j th row of the matrix 

Z and dj is the j th element of d and where 
ff Pir-.Pp) / and 91;= {i:i = JJ+l. — .N}= the risk set at 

the j th ordered event time t(j} . 

Preferably the log likelihood function based on the 
Parametric Survival model is of the form: 



i=i 



r 



log 



A{ yi ;<p) 



where 

p i =A(y i ;q>)exp\X i P) 



Cj»=l if the i th sample is uncensored and c t =0 if the i 



th 
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sample is uncensored. 

This form, of the likelihood function is shared by the 
Weibull, exponential and extreme value distributions. The 
functions X ( . ) and A(.) are as defined by Aitkin, and 
Clayton (19 80) . 

For any defined models, the component weights are 
typically estimated using a Bayesian statistical model 
(Kotz and Johnson, 1983) in which a posterior 
distribution of the component weights is formulated which 
combines the likelihood function and a prior 
distribution. The component weights are estimated by 
maximising the posterior distribution of the weights 
given the data generated for each training sample. Thus, 
the objective function to be maximised; consists of the 
likelihood function based on a model for the feature as 
discussed above and a prior distribution for the weights. 

Preferably, the prior distribution is of the form: 

p(P)=\p{py) P (v*)dv* 

wherein v is a p x 1 vector of hyperparameters , and where 
p.\J3 |v z J is N\Q,dx&g]y* j] and p[y 2 ) is some- hyperprior 
distribution for v 2 . This hyperprior distribution (which 
is preferably the same for all embodiments of the method) 
may be expressed using different notational conventions, 
and in the detailed description of the preferred 
embodiments (see below) , the following notational 
conventions are adopted merely for convenience for the 
particular preferred embodiment: 

As used herein, when the likelihood function for the 
probability distribution is based on a multinomial or 
binomial logistic regression, the notation for the prior 
distribution is: , 
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/>(A...,A-,)=Jn^(^|^)i>(r|)^ 

where fi T = (ff ,...0^) and T^^,...,^.,)/ 

and J^K) is A^O,diag(^)) and /»(tJ) is some hyperprior 
distribution for T* . 

As used herein, when the likelihood function for the N 
probability distribution is based on a categorical 
ordered logistic regression, the notation for the prior 
distribution is: 

>(A. A.-. A)- jft p (Aki )dr 

where y#i>A>*' 4 >A are component weights, P^jr,-) is N|0,z/ J and 
P(^- ) some hyperprior distribution for ri. 

As used herein, when the likelihood function for the 
distribution is based on a generalised linear model, the 
notation for the prior distribution is: 

P {i3)=\p(py\p{v*)dv> 

wherein v is a p x 1 vector of hyperparameters , and where 

p \P \ yZ ) is N[®>&*S\y* jj and P[ y2 ) is some prior 
distribution for v 2 . 

As used herein, when the likelihood function for the 
distribution is based on a hazard model, the notation for 
the prior distribution is: 

where P^P'ty*) is N\0,diag{v'\} and p(v 2 ) some hyperprior 
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distribution for v 2 . 

The prior distribution comprises a hyperprior that 
ensures that zero weights are preferred whenever 
possible. 

Preferably, the hyperprior is a Jeffrey's hyperprior 
(Kotz and Johnson, 1983). 

As discussed above, the prior distribution and the 
likelihood function are combined to generate a posterior 
distribution. The posterior distribution is preferably 
of the form: 

p(P<pv\y) a L(y\/3<p)p(j3\v 2 )p(v 2 ) 

wherein L{y\ft,<pj is the likelihood function. 

The component weights in the posterior distribution are 
preferably estimated in an iterative procedure such that 
the probability density of the posterior distribution is 
maximised. During the iterative procedure, component 
weights having a value less than a pre -determined 
threshold are eliminated, preferably by setting those 
component weights to zero. This results in elimination 
of the corresponding component . 

Preferably, the iterative procedure is an EM algorithm. 
The EM algorithm produces a sequence of component weight 
estimates that converge to give ccpmponent weights that 
maximise the probability density of the posterior , 
distribution. The EM algorithm consists, of two steps, 
known, as the E or Expectation step and the M, or 
Maximisation step. In the E step, the expected value of 
the log-posterior function conditional on the observed 
data and current parameter values is determined. In the 
M step, the expected log-posterior function is maximised 
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to give updated component weight estimates that increase 
the likelihood. The two steps are alternated until ■. 
convergence! of the E step arid the M step is achieved, or 
in other words, until the expected value and the 
maximised value of the log-posterior function converge.. 

It is envisaged that the method of the present invention 
may be applied to any system from which measurements can 
be obtained, and preferably systems from which very large 
amounts of data are generated. Examples of systems to 
which the method of the present invention may be applied 
include biological systems, chemical systems, 
agricultural systems, weather systems, financial systems 
including, for example, credit risk assessment systems, 
insurance systems, marketing systems or company record 
systems , electronic systems , physical systems , 
astrophysics systems and mechanical systems. For 
example, in a financial system, the samples may be 
particular stock and the components may be measurements 
made on any number of factors which may affect stock 
prices such as company profits, employee numbers, number 
of - shareholders etc. 

The method of the present invention is particularly 
suitable for use * in analysis of biological systems. The 
method of the present invention may be used to identify 
subsets of components for classifying samples from any 
biological system which produces measurable values for 
the components and in which the components can be 
uniquely labelled. In other words, the components are 
labelled or organised in a manner which allows data from 
one component to be distinguished from data from another 
component. For example, the components may be spatially 
organised in, for example, an array which allows data 
from each component to be distinguished from another by 
spatial position, or each component may have some unique 
identification associated with it such as an 
identification signal or tag. For example, the 
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components may be bound to individual carriers, each 
•carrier having a detectable identification signature such 
as quantum dots (see for example, Rosenthal, 2001, Nature 
Biotech 19: 621-622; Han et a 1 . (2001) Nature 
Biotechnology 19: 631-635), fluorescent markers (see for 
example, Fu et al, (1999) Nature Biotechnology 17: 1109- 
1111) , bar- coded tags' (see for example, Lockhart and 
Trulson (2001) Nature Biotechnology 19: 1122-1123), 

In a particularly preferred embodiment, the biological 
system is a biotechnology array. Examples of 
biotechnology arrays (examples of which are described in 
Schena et al . , 1995, Science 270: 467-470; Lockhart et 
al. 1996, Nature Biotechnology 14: 1649; US Pat No. 
5,569,5880) include oligonucleotide arrays.,; DNA arrays, 
DNA microarrays, RNA arrays, RNA microarrays, DNA 
microchips, RNA microchips, protein arrays, protein 
microchips, antibody arrays, chemical arrays, 
carbohydrate arrays, proteomics arrays, lipid arrays. In 
another embodiment, the biological system may be selected 
from the group including, for example, DNA or RNA 
electrophoresis gels, protein or proteomics ' 
electrophoresis gels, biomolecular interaction analysis 
such as Biacore analysis, amino acid analysis, ADMETox 
screening (see for example High- throughput ADMETox 
estimation: In Vitro and In Silico approaches (2002), 
Ferenc Darvas and Gyorgy Dorman (Eds) , Biotechniques 
Press) , protein electrophoresis gels and proteomics 
electrophoresis gels. 

Th£ components may be any measurable component of the 
system. In the case- of a biological system, the 
components may be, for example, genes or portions 
thereof, DNA sequences, RNA sequences, peptides, 
proteins, carbohydrate molecules, lipids or mixtures 
thereof,, physiological components, anatomical components, 
epidemiological components or chemical components. 



WO 03/034270. 



17 



PCT/AU02/01417 



The training samples may be any data obtained from a 
system in which the feature of the sample is known. For 
example, training samples may be data generated from a 
sample applied to a biological system. For example, when 
the biological system is a DNA microarray, the training 
sample may be data obtained from the array following 
hybridisation of the array with RNA extracted from cells 
having a known feature, or cDNA synthesised from the RNA 
extracted from cells, or if the biological system is a 
proteomics electrophoresis gel, the training sample may 
be generated from a protein or cell extract applied to 
the system. 

The inventors envisage that the method of the present 
invention, may be used in one embodiment in re-evaluating 
or evaluating test data from subjects who have presented 
mixed results in response to a- test treatment. Thus, in 
a second aspect, the present invention provides a method 
for identifying a subset of components of a subject which 
are capable of classifying the subject into one of a 
plurality of predefined groups wherein each group is 
defined by a response to a test treatment comprising the 
steps of: 

(a) exposing a plurality of subjects to the test 

treatment . and grouping the subjects into response 

groups based on responses to the treatment; 

(bj measuring components of the subjects; 

(c) identifying a subset of components that is capable 

of classifying the subjects into response groups 

using a statistical analysis method. 
Preferably, the statistical analysis method is a method 
according to the first aspect of the invention. 

, Once a subset of components has been identified, that 
subset can be used to classify subjects into groups such 
as those that are likely to respond to the test treatment 
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and those that are not. In this manner, the method of 
the present invention permits treatments to be identified 
which may be effective for a fraction of the population, 
and permits identification of that fraction of the 
population that will be responsive to the test treatment. 

In a third aspect, the present invention provides an 
apparatus for identifying a subset of*' components of a 
subject, the subset being capable of . classifying the 
subject into one of a plurality of predefined response, 
groups wherein each response group is formed by exposing 
a plurality of subjects to a test treatment and grouping 
the subjects into response groups based on the response 
to the treatment, the apparatus comprising-; 

(a) means for receiving measured components of the 
subjects; 

(b) means for identifying a subset of components that is 
capable of classifying the subjects into response 
groups using a statistical analysis method. 

Preferably, the statistical analysis method is the method 
according to the first or second aspect. 

In a fourth aspect, the present invention provides a 
method for identifying a subset of components of a 
subject which are capable of classifying the subject as 
being responsive or non-responsive to treatment with a 
test compound comprising the steps of: 

(a,) exposing a plurality of subjects to the compound and 
grouping the subjects into response groups based on 
each subjects response to the compound; 

(b) measuring components of the subjects; 
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(c) identifying a subset of components that is capable 
of classifying the subjects into response groups 
using a statistical analysis method. 

Preferably, the statistical analysis method is the "method 
according to the first aspect. 

In a fifth aspect, the present invention provides an 
apparatus for identifying a subset of components of a 
subject, the subset being capable of classifying the 
subject into one of a plurality of predefined response 
groups wherein each response group is formed by exposing 
a plurality of subjects to a compound and grouping the 
subjects into response groups based on the response to 
the compound, the apparatus comprising; 

(c) means for receiving measured components of the 
subjects; 

(d) means for identifying a subset of components that is 
capable of classifying the subjects into response 
groups using a statistical analysis method. 

Preferably, the statistical analysis method is the method 
according to the first or second aspect of the invention. 

The components that are measured in the second to fifth 
aspects of the invention may be, for example, genes or 
small nucleotide polymorph! sms .(SNPs) , proteins, 
antibodies, carbohydrates, lipids or any other 
measureable component of the subject. 

In a particularly preferred embodiment, the compound is a 
pharmaceutical compound or a composition comprising a 
pharmaceutical compound and a pharmaceutically acceptable 
carrier. 
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The identification method of the present invention may be 
implemented by appropriate computer software and 
hardware . 

In accordance with a sixth aspect, the present invention 
provides an apparatus for identifying a subset of 
components of a system from data generated from the 
system from a plurality of samples from the system, the 
subset being capable of predicting a feature of a test 
sample, the apparatus comprising; 

(a) means for generating a linear combination of 
components and component weights in which values for 
each component are introduced from data generated 
from a plurality of training samples, each training 
sample having a known feature; 

(b) means for defining a model for the probability 
distribution of a feature wherein the model is 
conditional on the linear combination and wherein 
the model is not a combination of a binomial 
distribution for a two class response with a probit 
function linking the linear combination and the 
expectation of the response; 

(c) means for constructing a prior distribution for the 
component weights of the linear combination 
comprising a hyperprior having a high probability 
density close to zero; 

(d) means for combining the prior distribution and the 
model to generate a posterior distribution; 

(e) means for identifying a subset of components having 
component weights that maximise the posterior 
distribution. 
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The apparatus may comprise an appropriately programmed 
computing device. 

In accordance with a seventh aspect, the present 
invention provides a computer program arranged, when 
loaded onto a computing apparatus, to control the 
computing apparatus to implement a method in accordance 
with*" the first aspect of the present invention. 

The computer program may implement any of the preferred 
algorithms and method steps of the. first or second aspect 
of the present invention which are discussed above. 

In accordance with a eighth aspect of the present 
invention, there is. provided a computer readable medium 
providing a computer program in- accordance with the 
fourth aspect of the present invention. 

In accordance with a ninth aspect of the present 
invention, there is provided a method of testing a sample 
from a system to identify a feature of the sample, the 
method comprising the steps of testing for a subset of 
components which is diagnostic of the feature, the subset 
of components having been determined by a method in 
accordance with the first or second aspect of the present 
invention. 

Preferably, the system is a biological system. 

In accordance with a tenth aspect of the present- 
invention, there is provided an apparatus for testing a 
sample from a system to determine a feature of the 
sample, the apparatus including means for testing for 
components identified in accordance with the method of 
the first or second aspect of the present invention. 

In "accordance with an eleventh aspect, the present 
invention provides a computer program which when run on a 
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computing device, is arranged to control the computing 
device, in a method of identifying components from a 
system which are capable of predicting a feature of a 
test sample from the system, and wherein a linear 
combination of components and component weights is 
generated from data generated from a plurality of 
training samples, each training sample having a known 
feature, and a posterior distribution is generated by 
combining a prior distribution for the component weights 
comprising a hyperprior having a high probability 
distribution close to zero, and a model that is 
conditional on the linear combination wherein the model 
is not a combination of a binomial distribution for a two 
class response with ; a probit function linking the linear 
combination and the expectation of the response, to 
estimate component weights which maximise the posterior 
distribution. 

Where aspects of the present invention are implemented by 
way of a computing device, it will be appreciated that 
any appropriate computer hardware e.g. a PC or a 
mainframe or a networked computing infrastructure, may be 
used. 

In a twelfth aspect, the present- invention provides a 
method for identifying a subset of components of a 
biological system, the subset being capable of predicting 
a feature of a test sample from the biological .system, 
the method comprising the steps of: 

(a) generating a linear combination of components and 
component weights in which values for each component 
are determined from data generated from a plurality of 
training samples, each training sample having a known 
feature; .. 
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(b) defining a model, for the probability distribution 
of* a feature wherein the model is conditional on the 
linear combination; 

(c) constructing a prior distribution for the component 
weights of the linear combination comprising a 
hyperprior having a high probability density close to 
zero; 

(d) combining the . prior distribution and the model to 
generate a posterior distribution; 

identifying a subset of components having component 
weights that maximise the posterior distribution. 

BRIEF DESCRIPTION OF THE FIGURES 

Figure 1 illustrates the results of a permutation test on 
prediction success of an embodiment of the present 
invention. Class labels were randomly permuted 200 times, 
and the analysis repeated for each permutation. The 
histogram shows the- distribution of prediction success 
under permutation. The number of samples that were 
correctly classified is shown on the x-axis and the 
frequency is shown on the y-axis . 

Figure 2 illustrates the results of a permutation test on 
prediction success of an embodiment of the present 
invention. Class labels were randomly permuted 200 times, 
and the analysis repeated for each permutation, The 
histogram shows the distribution of prediction success 
under permutation of the class labels. The x-axis is the 
percentage of the total of samples and the y-axis 
(lambda) is the percent of cases correctly classified. 

Figure 3 illustrates a plot of the curve for a 
generalised linear model used in one embodiment of the 
-method of the invention. The fitted curve- (solid line) 
is produced when 5 components selected by the method are 
used in the model, and the true curve (dotted line) is 



WO 03/034270 



24 



PCT/AU02/01417 



shown as a dotted line, and the data (nf , y-axis) from 
200 observations (x-axis) based on the 5 components is 
shown as circles. 

Figure 4 illustrates a plot of the fitted probabilities 
for a single gene identified using an embodiment of the 
method of the invention. The gene index is shown on the 
x-axis and the probability of the sample belonging to a 
particular ordered class is shown on the y-axis. The 
lines denote classes as follows: dashed line = class 1, 
solid line = class 2, dotted line = class 3, dotted and 
dashed line = class 4. 

Figure 5 is a schematic representation of a personal 
computer used to implement a system according to the 
present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present invention identifies preferably a minimum 
number of components . which can be used to identify 
whether a particular training sample has a particular 
feature. The minimum number of components is 
"diagnostic" of that feature, or enables discrimination 
between samples having a different feature. Essentially, 
from all the data which is generated from the system, the 
method of the present invention enables identification of 
a minimum number of components which can be used to test 
for a particular feature. Once those components have 
been identified by this method, the components can be 
used in future to assess new samples. The method of the 
present invention utilises a statistical method to 
eliminate components that are not required to correctly 
predi ct the feature . 

The inventors have found that component weights of a 
linear combination of component's of data generated from 
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the training samples can be estimated in such a way as to 
eliminate the components that are hot required to 
correctly predict the f eature of the training sample . 
The result is that a subset of components are identified , 
which can correctly predict the feature of the training 
sample. The method of the present invention thus permits 
identification from a large amount of data a relatively 
small number of components which are capable of correctly 
predicting a feature. 

The method of the present invention also has the 
advantage that it requires usage of less computer memory 
than prior art methods which use joint rather than 
marginal information on components. Accordingly, the 
method of the present invention can be performed rapidly 
on computers such as, for example, laptop machines. By 
using less memory, the method of the present invention 
also allows the method to be performed moire quickly than 
prior art methods which use joint (rather than marginal) 
information on components for analysis of, for example, 
biological data. 

A first embodiment relating to a multiclass logistic 
regression model will now be described. 

A. Multi Class Logistic regression model 

The method of this embodiment utilises the training 
samples in order to identify a subset of components which 
can classify the training samples into pre-defined 
groups. Subsequently, knowledge of the subset of 
components can be used for tests, for example clinical 
tests, to classify samples into groups such as disease 
classes. For example,- a subset of components of a DNA 
microarray may be used to group clinical samples into 
clinically relevant classes such as, for example, healthy 
or diseased. 
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In this way, the present invention identifies preferably 
a minimum number of components which can be used to 
identify whether a particular training sample belongs to 
a particular group. The minimum number of components is 
"diagnostic" of that group, or enables discrimination 
between groups. Essentially, from all the data which is 
generated from the system, the method of the present 
invention enables identification of a minimum number. of 
components which can be used to test for a particular 
group. Once those components have been identified by 
this method, the components can be used in future to 
classify new samples into the groups. The method of the 
present invention preferably utilises a statistical 
method to eliminate components that are not required to 
correctly identify the group the sample belongs to. 

The samples are grouped into sample groups (or "classes") 
based on a pre-determined classification. The 
classification may be any desired classification by which 
the training samples are to be grouped. For example, the 
classification may be whether the training samples are 
from a leukemia cell . or a healthy cell, or that the 
training samples are obtained from* the blood of patients 
having or not having a certain condition, or that the 
training samples are from a cell from one of several 
types of cancer as compared to a normal cell. 

In one embodiment, the input data is organised into an 
nxpdata matrix X ^(xy) with n training samples and p 
.components.* Typically, p will be much greater than n. 

In another embodiment, data matrix X may be replaced by 
an n x n kernel matrix K to obtain smooth functions of X 



WO 03/034270 



PCT/AU02/01417 



- 27 / 

as, predictors instead of linear predictors . An example 
of the kernel matrix K is kij=exp ( -0 .* 5* (xi-Xj) * (xi-Xj) /a 2 ) 
where the subscript on x refers to a row number in the 
matrix X. Ideally, subsets of the columns of K are 
selected which give sparse representations of these 
smooth functions. Further examples of kernel matrices are 
given in table 2 below, (is table 3 needed at all ?) 

Associated with each sample class (group) may be a class 
label y f , where y i = e {l,...,G}, which indicates which of G 
sample classes a training sample belongs to. We write 
the nxl vector with elements y x as y. Given the vector y 
we can define indicator variables 

[0, otherwise 

(1A 

) 

In one embodiment, the component weights are estimated 
using a Bayesian statistical model (see Kotz and Johnson, 
1983) . Preferably, the weights" are estimated by 
maximising the posterior distribution of the weights 
given, the data generated from each training sample. This 
results in an objective function to be maximised 
consisting of two parts. The first part a likelihood 
function and the second a prior distribution for the 
weights which ensures that zero weights are preferred 
whenever possible. In a preferred embodiment, the 
likelihood function is derived from a multiclass logistic 
model- Preferably, the likelihood function is computed 
from the probabilities: 
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(2A) 



and 



Ag. : 



j_ OA) 



1 + 



Wherein 

p l9 is the probability that the training sample with input 
data Xi will be in sample class g; 

xj% is a linear combination generated from input data 
from training sample i with component weights fi g ; 
xT is the components for the i th Row of' X and p g is a set 
of. component weights for sample class g; 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a 
priori assumption that most of the component weights are, 
zero. 

In one embodiment, components weights fig in equation (2A) 
are estimated in a manner whereby most of the values are 
zero, yet the samples can still be accurately classified. 

In one embodiment, the prior specified for the parameters 
y? ls ...,y5 G _, is of the form: 
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H&->0G-i)- j uHM r D p ( T l) dr2 (4A) 

T 2 «*■' 



where fi T and ^ = (*i r .~»*oLi)- 

and p(^|t|) is iv(o,diag{T*}) and P^JJaflV** is a Jeffre y s 

1=1 

hyperprior, Kotz and Johnson (1983) . 

In one embodiment, the likelihood function is £(y|A»...» A^i) 

of the form in equation (8A) and the posterior 
distribution of ft and T given y is 

p(PT\y)aL(y\p)p(j3\T)p{T) (5A) 

In one embodiment, the first derivative is determined from 
the following equation: 



3iog£ 



= jr r (g,-/> g ), g«l,...,G-l (6A) 



wherein =(^,z = 1,«) , ~{p ig J = 1,«) are vectors indicating 

membership of sample class gr and probability of class g 
respect ively v . 

In one embodiment, the second derivative is determined 
from the following algorithm: 



3 2 logl, 



= -X T diag{S hg p g ,-p hPg }x (7A) 



Equation 6 and equation 7 may be derived as follows: 
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(a) Using equations (1A) , (2A) and (3A) , the likelihood 
function of the data can be written as: 



/«1 



G-l 

n 

8=1 



J fit 



v. 8 =1 J. 



1 



G-l _ 



*G 



(8A) 



(b) Taking logs of equation (8A) and using the fact that 

G 

51^=1, for all i gives: 



(9A) 



(c) Differentiating equation (9A) with respect to fig 
gives 



9 log I, 



= ^ 7 (g g -/? g ), g=U.,G-l (10A) 



whereby = (e ig9 i =l s ;z) , p£ =(/? fg ,i = l,n) are vectors indicating 

membership of sample class g and probability of class g. 
respectively. 

(d) The second derivative of equation (9A) has elements 



l0L^x T dia g {S hg p g - PhPg }X (11A) 



where 
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otherwise 



Component weights which maximise the posterior 
distribution of the likelihood function may be specified 
using an EM algorithm comprising an E step and an M step. 

Typically, the EM algorithm comprises the steps: 

(a) performing an E step by calculating the 

conditional expected value of the posterior 
distribution of component weights using the 
function.: 



1 G " 1 r -i—2 



where xj fi t ^xTP J£ f g in equation (8A) 

(b) performing an M step by applying an iterative 
procedure to maximise Q as a function of y 
whereby: 



M 1 



(13A) 



where a 1 is a step length such that 0<ar'<l; 
p g = P g y g ; 

wherein P g are matrices of zeroes and ones such that P T g J3 g 
selects non-zero elements of fi g ; and 
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Equation (12A) may be derived as follows: 

Calculate the conditional expected value of 5A) given the 
observed data y and a set of parameter estimates p . 

Consider the ease when components of p (and p) are set 
to zero i.e for g =1,..., G-l, fi g = P g Y g and P g =P g f g , where the 
P are matrices of zeroes and ones such that P g P g 
selects the non zero elements of p g . In the following we 
write y = ( y g , 9=1, — G-l) . Note that the y g are actually 
subsets of the components of fi g . We use them to keep the 
notation. as simple as possible. 

Ignoring terms not involving y and using (4A) , (5A) , (9A) 
we' get : 

Q = logL-\%±E&\yA 
where ^P g ^^ p g fz in (8A) 

Note that the conditional expectation can be evaluated 
from first principles given (4A) . 

The iterative procedure may be derived as follows: 
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To obtain the derivatives required in (13 A) , first note 
that from (8A) , (?A) and (10 A) we get 



and 



*f(e.-A) 

■%G-l ( e G-l ~~ Pg-1 ) 



-diag\f^ 2 Y (15A) 



ar 2 UrJ a 2 £ Urj 



^G-l^G-\,G-l^G-l 



+ diag{f} 2 



(16A) 



where 



A gh =diag{S gh p g -p gPh }, 

5 J hg=h 
gh [0, otherwise 



and 



X T g =P g T X T ,g = l,...G-\. 



(17 A) 
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In a preferred embodiment, the iterative procedure may be 
simplified by using only tjie block diagonals of equation 
(16A) in equation (13A) . For g = l,...G-l, this gives: 

) 

Rearranging equation (18A) leads to 

Y? =7' e +a'diag{f g }(Y^A gg Y g + {jf (e g - Pg )-diag{f s Y (19A) 
where 

Y g T = diag{f g }xl 

Writing p(g) for the number of columns of Y g ., (19A) 
requires the inversion of a p{g)xp(g) matrix which may be 
quite large. This can be reduced to an nxn matrix for 
p(g)>n by noting that: 

(20A) 

I 

where Z g =A^7 g . Preferably, (19A) is used when p(g)<n and 
(19A) with (20A) substituted into equation (19A) is used 
when p(g)>n. 

In a preferred embodiment, the EM algorithm is performed 
as follows: 
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1. Set n=0 ,P g =I and chobse an initial value for y°. This 

is done by ridge regression of log (p ig /pi G .) on Xi where 
Pig is chosen to be near one for observations in group g 
and a small quantity >0 otherwise - subject to the 
constraint of all probabilities summing to. one. 

2. Do the E step i.e evaluate Q = Q[y\y,y n ) 

3. Set t=0. For g =l,.„ 9 G — 1 calculate: 

a) 8g=yg +l -Yg using (19A) with (20A) ' substituted into 
(19A) when p(g)>n. 

(b) Writing 8* =(S' g ,g = l,...,G-l) Do a line search to find 
the value of a 1 in y M =y t +<x'S t which maximises (or 

simply increases) (12A) as a function of a* . 
c). set y' +l =y* and t=t+l 
Repeat steps (a) and (b) until convergence. 

This produces y* n+l say which maximises the current Q 
function as a function of y. 

For g = l,...G-l determine £^=jy: 

Where £ <$cl , say 1CT 5 . Define P g so that fi ig =0 for 
i G S g and 

n +x ={rT l >j*s*} 

This step eliminates variables with small coefficients 
from the model.' 





<£max 


♦n+l 


r j S 


7 ^ 


it 
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4. Set n=n+l and go to 2 until convergence. 

A second embodiment relating to an categorical ordered 
logistic regression will now be described. 

B. Ordered categories model 

The method of this embodiment may utilise the training 
samples, in order to identify a subset of components which 
can be used to determine whether a test sample belongs to 
a particular class. For example, to identify genes for 
assessing a tissue biopsy sample using microarray 
analysis , microar ray data from a series of samples from 
tissue that has been previously ordered into classes of 
increasing or decreasing disease severity such as normal 
tissue, benign tissue, localised tumour and metastasised 
tumour tissue are used as training samples to identify a 
subset of components which is capable of indicating the 
severity of disease associated with the training samples. 
The subset of components can then be subsequently used to 
determine whether previously unclassified test samples 
can be classified as normal, benign, localised tumour or 
metastasised tumour. Thus, the subset of components is 
diagnostic of whether a test sample belongs to a 
particular class within an ordered set of classes. It 
will.be apparent that once the subset of components have 
been identified, only the subset of components need be 
tested in future diagnostic procedures to determine to 
what, ordered class a sample belongs. 

The method of the invention is particularly suited for 
the analysis of very large amounts of data. Typically, 
large data sets obtained from test samples is highly 
variable and often differs significantly from that 
obtained from the training samples . The method of the 
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present invention is able to identify subsets of 
components from a very large amount of data generated 
from training samples , and the subset of components 
identified by the method can then be used to classifying 
test samples even when the data generated from the test 
sample is highly variable compared to the data generated 
from training samples belonging to the same class. Thus, 
the method of the invention is able to identify a subset 
of components that are more likely to classify a sample 
correctly even when the data is of poor quality and/or 
there is high variability between samples of the same 
ordered class. 

The minimum number of components is "predictive" for that 
particular ordered class. Essentially, from all the data 
which is generated from the system, the method of the 
present invention enables identification of a minimum 
number of components which can be used to classify the 
training data. Once those components have been 
identified by this method, the components can be used in 
future to classify test samples. The method of the 
present invention preferably utilises a statistical 
method to eliminate components that are not required to 
correctly classify the sample into a class that is a 
member of an ordered class. 

In the following there are N samples , and vectors such as 
y, z and jx have components yi, zi and \i\ for i - 1,..., N. 
Vector multiplication and division is defined component- 
wise and diag{ * } denotes a diagonal matrix whose 
diagonals are equal to the argument. We also use | | * | | 
to denote Euclidean norm. 
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Preferably, there are N observations y § where y f takes 
integer values 1,„.,G. The values denote classes which are 
ordered in some way such as for example severity of 
disease. Associated with each observation there is a set 
of covariates (variables, e.g gene expression values) 
arranged into a matrix X with N rows and p columns 
wherein N is the samples and p the components. The 
notation x. r denotes the i th row of X. Individual (sample) 
i has probabilities of belonging to class k given by 
• 

Define cumulative probabilities 

r* = 2>* ' k = G 

Note that y ik is just the probability that observation i 
belongs to a class with index less than or equal to k. 
Let C be a n by p matrix with elements c i} given by 

r _ fl» if observation i in class j 
c ij f~ 1 0, otherwise 

and let R be an n by P matrix with elements r {} given by 

These are the cumulative sums of the columns of C within 
rows . 

For independent observations (samples) the likelihood of 
the data can be written as 



n g-\ ( v 



Z*±! ?*-J (ib) 



and the log likelihood (log(D) 2 can be written as 



(2B) 
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The continuation ratio model may be adopted here as 
follows: 



log it p*±i r* L logit 



=e k +xjj? (3B) 



for k = 2,~.,G , see McCullagh and Nelder(1989) and 
McCullagh(198 0) and the discussion therein. Note that 



log it 



r Ym ^ V_ logit fj^l * (4B) 



The likelihood is equivalent to a logistic regression 
likelihood with response vector yand covariate matrix X 

y=vee{R) 

where/ G _ t is the G-l by G-l identity matrix and l c . t is a 
G-l by 1 vector of ones. 

Here vec{ } takes the matrix and forms a vector row by 



row. 



Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a 
priori assumption that most of the component weights are 
zero . 

Following Figueiredo (2001) , in order to eliminate 
redundant variables (covariates) , a prior is specified for 
the parameters J3* by introducing a p x 1 vector of 
hyperparameters . 

Preferably, the prior specified for the component weights 
is of the form 
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p{P')=\p{f}Y)p{v 2 )dv z 

(5B) . - ' 

n 

where p{p*\v 2 ) is TV^diagjv 2 }) and ^(v^orjjl/v; 2 is a Jeffreys 

prior, Kotz and Johnson (1983) . The elements of # = (# 2 ». . 0 G ) r 
have a non informative prior. 

Writing L^j>S*^^ for the likelihood function, in a 

Bayesian framework the posterior distribution of /?* , 9 
and v given y is 

p{P*ev\y) a L{y\fe)p(py)p(v) (6B) 

Preferably, by treating v as a vector of missing data, an 
iterative algorithm such as an EM algorithm (Dempster et 
al, 197 7) can be used to maximise (6B) to produce locally 
maximum a posteriori estimates of /?* and 6. The prior 
above is such that the maximum a posteriori estimates will 
tend to be sparse i.e. if a large number. of parameters are 
redundant, many components of fi* will be zero. 
Preferably J3 T =(& T ,P* T ) in the following and diag() denotes 
a diagonal matrix: 

For the ordered categories model above it can be shown 
that 



(7B) 
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dp 

where exp(x? , >0)/(l+exp(x^))and f = (0 2 ,...,0 a ,/r r ) . 

As mentioned above, the component weights which maximise 
the posterior distribution may be determined using an 
iterative procedure. Preferable, the iterative procedure 
for maximising the posterior distribution of the . 
components and component weights is an EM algorithm, such 
as, for example, that described in Dempster et al, 1977. . 
Preferably, the EM algorithm is performed as follows: 

1. Set n=0, S 0 = {1,2,..., p } , <|> (0) , and e =1(T 5 (say). 
Set the regularisation parameter k at a value much 
greater than 1, say 100. This corresponds to adding l//c 2 
to the first G-l diagonal elements of the second 
derivative matrix in the M step below. 

If p < N compute initial values P* by 

p*=(X t X+M)- l X T g(3H-Q < 9B ) 
and if p > N compute initial values p* by 

P*=-(I -X T (XX T +H)" 1 X)X T g(ir+Q ( 10B > 

where the ridge parameter X satisfies 0 < X < 1 and £ is 
small and chosen so that the logit link function g is well 
defined at y+^ . 
2. Define 
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3 (n) = f PI, ^ S n 
0, otherwise 



and let P n be a matrix of zeroes and ones such that the 
nonzero, elements y (n) of p (n) satisfy 

y = P n T p , p=P n y 
Define w fi = (w fin i = l,p), such that 

, 1, i > G 
p 1 0, otherwise 

and let w y =P n w fi 

3. Perform the E step by calculating 

Q(P I P (n) ) = E{ log(p(p, v | y» | y, P (n) } 

= l(y|P)-0.5(||(P*w /? )/p (n) H 2 ) 

where 2 is the log likelihood function of y. 



(11B) 



Using p=P n yand p< n) =P n y (n) (11B) can be written as 

Q(y | y (n) )= l(y I P„y)-0.5 (||(y*w r yy (n) || 2 ) , (12B) 

4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set y 0 = Y (n) and for r=0,l,2>... 
Yr+i = Yr + a r 5 r where a r is chosen by a line search 
algorithm to ensure QCYpmIy* 0 ) >Q(Y r lY (n) )- 
For p < N use 



8 = diag(Y W )[Y n T V r - l Y n+ I]-(Y: Zr -^) (13B) 
where 
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i > G 
otherwise 



Y n T -di^g(y (n) )P^X T 
V r - l =diag{/i r (l-// r )} 



and n r = exp( XP nYr )/(l+exp( XP n y r )) . 



For p > N use 



w y 

5 r =diag(Y< n >)[I -Y:(Y n Y: + V r y l YJ(Y:z r ---^) 



(14B) 



with V r and z r defined as before . 

Let y* be the value of y r when some convergence criterion 
is satisfied e.g 

I | Yr - Yr+i| | < £ (for example 1CT 5 ) . 

5. Def ine P* = P n7 * , S n+1 ={i>G: | P, | >max(|p. |*e, ) }u{l,2 5 ...,G-l} 

where Ei is a small constant, say le-5. Set n=n+l . 

6. Check convergence. If | | y* - y (n) | | < e 2 where s 2 is 
suitably small then stop, else go to step 2 above. 

Recovering the probabilities 

Once we have obtained estimates of the parameters P are 
obtained, calculate 




for i =1,...,N and k = 2,...,G. 
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Preferably,, to obtain the probabilities we use the 
recursion 

X iG = a (G 

and the fact that the probabilities sum to one, for i = 

1 f mmm f 1^ * 

In one embodiment , the covariate matrix X 
with rows Xi T can be replaced by a matrix K with ij th 
element k i3 and k Aj = k( x t - x 5 ) for some kernel 
function k . This matrix can also be augmented with a 
vector of ones. Some example kernels are given in Table 
below, see Evgeniou et al(1999). 



Kernel function 


Formula for k( x - y ) 


Gaussian radial basis function 


exp( - || x - y |r / a) , a>0 


Inverse multiquadric 


(|| x - y ||' + c a 


mMultiquadric 


(| x - y ||^ c* )/'" 


Thin plate splines 


11 x - y || — 

|l x - y | | 2n ln(|| x - y | |) 


Multi layer perceptron 


tanh( x'y-9 ) , for suitable G 


Ploynomial of degree d 


(1 + xy ) d 


B splines 


B 2n+ i(x - y) 


Tr igonome t r i c po lynomi al s 


sin(( d +1/2 ) (x-y) ) /sin( (x- 
y)/2) 



Table 1: Examples of kernel functions 
In Table 1 the last two kernels are preferably one 
dimensional i.e. for the case when X has only one column. 
Multivariate versions can be derived from products of 
these kernel functions. The definition of B 2n+ i can be 
found in De Boor (1978 ). Use of a kernel function 
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results in -estimated probabilities which are smooth (as 
opposed to transforms of linear) functions of the 
covariates X. Such models may give a substantially 
better fit to the data. 

A third embodiment relating to a generalised linear model 
will now be described. 

C. Generalised Linear Models 

The method of this embodiment utilises the training 
samples in order to identify a subset of components which 
can predict the characteristic of a sample. 
Subsequently, knowledge of the subset of components can 
be used for tests, for example clinical' tests to predict 
unknown values of the characteristic of interest. For 
example, a subset of components of a DNA microarray may 
be used to predict a clinically relevant characteristic 
such as, for example, a blood glucose level, a white 
blood cell count, the size of a tumour, tumour growth 
rate or survival time. 

In this way, the present invention identifies preferably 
a minimum number of components which can be used to 
predict a characteristic for a particular sample. The 
minimum number of components is "predictive" for that 
characteristic. Essentially, from all the data which is 
generated from the system, the method of the present 
invention enables identification of a minimum number of 
components which can be used to predict a particular 
characteristic. Once those components have been 
identified by this method, the components can be used in 
future to predict the characteristic for new samples. 
The method of the present invention preferably utilises a 
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statistical method to eliminate components that are not 
required to correctly predict the characteristic for the 

r- 

sample. 

i 

The inventors have found that component weights of a 
linear combination of components of data generated from 
the training samples can be estimated in such a way as to 
eliminate the components that are not required to predict 
a characteristic for a training sample. The result is 
that a subset of components are identified which can. 
correctly predict the characteristic for samples in the 
training set. The method of the present invention thus 
permits identification from a large amount of data a 
relatively small number of components which are capable 
of correctly predicting a characteristic for a training 
sample, for example, a quantity of interest. 

The characteristic may be any characteristic of interest . 
In one embodiment, the characteristic is a quantity or 
measure . In another embodiment , they may be the index 
number of a group, where the samples are grouped into two 
sample groups (or "classes") based on a pre -determined 
classification. The classification may be any desired 
classification by which the training samples are to be 
grouped. For example, the classification may be whether 
the training samples are from a leukemia cell or a 
healthy cell, or that the training samples are obtained 
from the blood of patients having or not having a certain 
condition, or that the training samples are from a cell 
from one of several types of cancer as compared to a 
normal cell. In another embodiment the characteristic may 
be a censored survival time, indicating that particular 
patients have survived for at least a given number of 
days . In other embodiments the quantity may be any 
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continuously variable characteristic of the sample which 
is capable of measurement, for example blood pressure. 

Iri one embodiment, the data may be a quantity y g , where 

i G {!,..., N} . We write the Nxl vector with elements y ( as y m 

We define a p x 1 parameter vector (3 6f component weights 
(many of which are expected to be zero) , and a q x 1 
vector of parameters (f> (not expected to be zero) . Note 
that q could be zero (i.e. the set of parameters not 
expected to be zero may be empty) . 

In one embodiment, the input data is organised into an 
iVxpdata matrix X = {xy} with N test training samples and 
p components. Typically, p will be much greater than N. 

In another embodiment, data matrix X may be replaced by 
an N x N kernel matrix K to obtain smooth functions of X * 
as predictors instead of linear predictors. An example 
of the kernel matrix K is kij=exp (-0 . 5* (xi-Xj) t (xi-Xj) /a 2 ) 
where the subscript on x refers to a row number in the 
matrix X. Ideally, subsets of the columns of K are 
selected which give sparse representations of these 
smooth functions. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a 
priori assumption that most of the component weights are 
zero . 

In one embodiment, the prior specified for the component 
weights is of the form: 
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P {P)=\p{p\v 2 )p(v 2 )dv> 

n 

where p(fi \v 2 ) is Ar(o,diag{v 2 }) and p(v 2 )aY[l/v? is a 

Jeffreys prior, Kotz and Johnson (1983) . Preferably, an 
uninf ormative prior for <p is specified. 

The likelihood function defines a model which fits the 
data based on the distribution of the data. Preferably, 
the likelihood function is derived from a generalised 
linear model. For example, the likelihood function 

L {y\P<p) ma Y be the form appropriate for a generalised 
linear model (GLM) , such as for example, that described by 
Nelder and Wedderburn (1972).. Preferably, the likelihood 
function is of the form: 

1 = log p(y | P, 9) = ^ { ^" 6 f> + c(^(p) } (2C) 

where y = (yi,~, y n ) T and a ± (^) = 4> /™i with the Wi being a 
fixed set of known weights and 4>. a single scale parameter. 

Preferably, the likelihood function is specified as 
follows : 
We have 
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(1C) 



E{y I }=bX9 i ) 

Var{y} = b'^KOP) = xfa^cp) < 3C > 

Each observation has a set of covariates xi and a linear 
predictor 7?i = Xi T (3. The relationship between the mean of 
the i th observation and its linear predictor is given by 



WO 03/034270 



PCT/AU02/01417 



49 - 

the link function t?i = g(Mi) =*■ g( b' (0±) ) . The inverse of 
the link is denoted by h, i.e 
Mi = b' = h(7/i) . 

In addition to the scale parameter, a generalised linear 
model may be specified by four components : 

• the likelihood or (scaled) deviance function, 

• the link function 

• the derivative of the link function 

• the variance function. 



Some common examples of generalised linear models are 
given in table 2 below. 
Table 2 



Distribution 


Link function 

g(M) 


Derivative 
of link 
function 


Variance 
function 


Scale 
parame 
ter 


Gaussian 




1 


.1 


yes 


Binomial 
with n 
trials 


lo 6 U ) 


1 


n 


* no 


Poisson 


log (/i.) 


1/ A 


M 


no 


Gamma 


1/ H 


-1/ M 2 


M 2 


yes 


Inverse 
Gaussian 


1/ M 2 


-2/ M 3 


M 3 


yes 



In another embodiment, the likelihood function is derived 
from a multiclass logistical model. 

In another embodiment, a quasi likelihood model is 
specified wherein only the link function and variance 
function are defined. In some instances, such 
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specification results in the models in the table above. 
In other instances, no distribution is specified. 

In one embodiment, the posterior distribution of fi q> and 
v given y is estimated using: 

p{P<pv\y) a L(y\fi<p)p(j3\v)p{v) 

(4C) 

wherein Z,(j;|/?p) is the likelihood function. 

In one embodiment, v may be treated as a vector' of 
missing data and an iterative procedure used to maximise 
equation (2C) to produce locally maximum a posteriori 
estimates of p. The prior of equation (5C) is such that 
the maximum a posteriori estimates will tend to be sparse 
i.e. if a large number of parameters are* redundant, many 
components of p will be zero. 

As stated above, the component weights which maximise the 
posterior distribution may be determined using an 
iterative procedure. Preferable, the iterative- procedure 
for maximising the posterior distribution of the 
components and component weights is an EM algorithm, such 
as, for example, that described in Dempster et al, 1977. 

r 

In one embodiment, the EM algorithm comprises the- steps: 

(c) Initialising the algorithm by setting n=0, SO 
= {l,2,~, p } , initialise q> <0) , (3* and 
applying a value for s, such as for example z = 
10- 5 ; 

(d) Defining 

p(n )= f Pl> ieS n {5G) 
0, otherwise 
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and let Pn be a matrix of zeroes and ones such 
that the. nonzero elements y(n) of P(n) satisfy 

y(n) = pTp(n) ^ p(n) = p ^(n) 

y = P n T P , p=P n7 

(e) performing an estimation (E) step by 
calculating the conditional expected value of 
the posterior distribution of component weights 
using the function: 

Q(p|p (n) > 9 (n) )=E{logp(p,<p 5 v|y)| y> p ( ">,cp> (n >> 
= l(y | P, cp<"> ) - 0.5 01 P / ) 
where 1 is the log likelihood function of y. 
Using p = P n y and p (n) = P n y (n) can be written as 

Q(y | y (n) , 9 (n) ) = l(y I P n Y> cp (n >)-0.5 (Hy/y (n> || 2 ) <7C) 

(f) performing a maximisation (M) step by applying 
an iterative, procedure to maximise Q as a 
function of y whereby y 0 = y (n> and for r=0,l,2,... 

(g) yr+i = y r + oc r 5r and where a r is chosen by a 

Ofv I v (n) m (n)> > 

line search algorithm to ensure vcwi+i.i i > *k j > 
Q(Y, 1 7 W , 9*), and 



O Yr ^Yr Y 



8=diag(Y (n> )[^ag(Y (n) )^-diag(Y (n) W I (f i --^) ( bc) 
where : 



51 ^=p:^p„ (9c> 



for Pr= P nY, 



Let y* be the value of yr when some convergence 
criterion is satisfied, for example, | | yr - yr+1 | | 
< s (for example 10" 5 ) ; 
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(h) Defining p* = P n y* ., S n+1 ={i: | ft | >max(|P j | *s t ) } 
where s x is a small constant, for example le-5 . 

(i) Set n=n+l and choose (p (n+1) = (p (n> + K n ( cp* - 
<p (n) ) where 9* satisfies — l(y | P n y*,q>) = 0 and K n 

is a damping factor such that 0< K n < 1; and 

(j) Check convergence. If | | y* - y (n) | | < s 2 where s 2 
is suitably small then stop, else go to step 
(b) ; above . 

In another embodiment, step (d) in the maximisation step 
may be estimated by replacing — — with its expectation 



B{— — }. This is preferred when the model of the data is 



35 

a generalised linear model 



r ^1 1 

For generalised linear models the expected value E{ — — } 

d z y r 

may be calculated as follows: 



SP Tj dr\. a.(q>) 

(10C) 

where X is the N by p matrix with i th row xi T and 
= -X T diag(a i (cp)Tf(|^) 2 )- 1 X 
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This can be written as 




(12C) 



E{-^4}=-X l V , X 



(13C) 



where V-diag( ai (q>)Tf(^) 2 ) . 

Preferably/ the EM algorithm comprises the steps: 

(a) Initialising the algorithm by setting n=0 , SO = 
{1,2,..., p } ,9(0) , applying a value for s, 
such as for example s = 10" 5 , and 



(15C) 

where the ridge parameter X satisfies 0 ~< X < 1 and 
^ is small and chosen so that the link function g 
is well defined at y+C, . 
(b) Defining 



If p < N compute initial values p* by 
P'-CX'X+Aiy'X^Cy+Q 



(14C) 



and if p > N compute initial values P* by 



p*=l(I -X T (XX T +9J) 1 X)X T g(y+0 



p<->= { 



PI, isS n 
0, otherwise 



and let Pn be a matrix of zeroes and ones such 



that the nonzero . elements y (n) of p(n) satisfy 
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y< n > = P n T p (n) y p< n > = p y(") 

Y = PJP , P=P n Y 

(c) performing an estimation (E) step by calculating 
the conditional expected value, of the posterior 
distribution of component weights using the 
function: 

Q(p | P (n) , cp (n) ) = E{ logp(p, q>,v|y)| y, p (n) , 9 « } . 

= lCy | q ><">)_ 0 .5 Cll ) U6C) 

where -L 1 is the log likelihood function of 
y. Using P=P nY and p< n >=P nY <"> (i 6 c) can be 
written as 

Q(Y I Y (n \ <P (n> ) = l(y I P„Y> <p (n) >0.5 (||Y/Y (n) ir ) (17C) 

(d) performing a maximisation (M) step by applying 
an iterative procedure, for example a Newton 
Raphson iteration, to maximise Q as a function 
of y whereby Yo = Y (n> and for r=0,l,2,„. Yr+i = 
Yr + ct r 8 r where a r is chosen by a line search 

algorithm to ensure Q(Yh-i lY (n \<P (n) ) > Q(Y r I T (n) , «P (n) ) , 
and 

For p < N use 

6,= di&g(yM)\YjV;X+y l (YlVz t -^) ( 18C) 

where 

Y n =diag( Y (n) )P n T X 
V-diag( ai ((pK(|3L) 2 ) 

and the- subscript r denotes that these quantities 
are evaluated at jx=h(XP n y r ). 
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For p- > N use 

8=diag( Y < I< >)[I-Y7(Y ll Y 1 f+V f )- , Y n ](Y n T V; , Zr-^) < 19C > 
with Vr and z r defined as before. 

Let y* be the vcilue of y r when some convergence 
criterion is satisfied e.g 

| | Yr - Yr+i| | < .6 (for example 10** 5 ) . 

1) Define P* = P n Y* , S rt ={i: | ft | >max(|p: 1*8, ) } where e x is 

j 

a small constant, say le-5. Set n=n+l and choose 
(p n+1 = cp n + Kn ( 9* - cp n ) where satisfies 

— l(y|P n Y*,<p)=0 and K n is. a damping factor such that 

0< K n < 1. Note that in some cases the scale 
parameter is known or this equation can be solved 
explicitly to get an updating equation for cp . 
The above embodiments may be extended to incorporate 
quasi likelihood methods Wedderburn (1974) and McCullagh 
and Nelder (1983)) . In such an embodiment, the same 
iterative procedure as detailed above will be 
appropriate, but with t the likelihood replaced by a 
quasi likelihood as shown above and, for example, Table 
8.1 in McCullagh and Nelder (1983) . In one embodiment 
there is a modified updating method for the scale 
parameter cp. To define these models requires 
specification of the variance function x 2 , the link 

dr\ 

function g and the derivative of the link function — . 

d\i 

Once these are defined the above algorithm can be 
applied. In one embodiment for quasi likelihood models, 
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step 5 of the above algorithm is modified so that the 
scale parameter is updated by calculating 

where \i and x are evaluated at (3* = P n y* . Preferably, 
this updating is performed when the number of parameters 
s in the model is less than N. A divisor of N-s can be 
used when s is much less than N. 

In another embodiment, for both generalised linear models 
and Quasi likelihood models the covariate matrix X with 
rows xi T can' be replaced by a matrix K with ijth element 
kij arid kij = K(xi-x-j) for some kernel function k . This 
matrix can also be augmented with a vector of ones. Some 
example kernels are given in Table 3 below, see Evgeniou 
et al (1999) .. 



Kerne 1 fuac t ion 


Formula for k( x - y ) 


Gaussian radial basis 
function 


exp( - | | x - y | | 2 / a) , 
a>0 


I nve r s e mul t i quadr i c 


( I 1 x - y | | 2 + c 2 T 1 ' 2 


mu 1 t i qua dri c 


( | | x - y | | 2 + c 2 ) 1/2 


Thin plate splines 


1 1 x - y || — 

|| x - y |j 2 *ln(||.x - y ||) 


Multi layer percept ron 


tanh( x'y-9 ) , for suitable 

e 


Ploynomial of degree d 


(1 + x-y ) a 


B splines 


B 2a+ i(x - y) 


Trigonometric polynomials 


sin({ d +1/2 ) (x-y)) /sin ( (x- 

y)/2) 



Table 3: Examples of kernel functions - 
In Table 3 the last two kernels are one dimensional i.e. 
for the case when X has only one column. Multivariate 
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versions can be derived from products of these kernel 
functions, r The definition of B 2n+ i can be found in De 
Boor (1978 ) . Use of a kernel function in either a 
generalised linear model or a quasi likelihood model 
results in mean values which are smooth (as opposed to 
transforms of linear) functions of the covariates X. 
Such models may give a substantially better fit to the 
data. 

A fourth embodiment relating to a proportional hazards 
model will now be described. 

D. -Proportional Hazard Models 

The method, of this embodiment may utilise training 
samples in order to identify a subset of components which 
are capable of affecting the probability that a defined 
event (eg death, recovery) will occur within a certain 
time period. —Training samples are obtained from a 
system and the time measured from when the training 
sample is obtained to when the event has occurred. Using 
a statistical method to associate the time to the event 
with the data obtained from a plurality of training 
samples, a subset of components may be identified that 
are capable of predicting the distribution of the time to 
the event . Subsequently, knowledge of the subset of 
components can be used for tests, for example clinical 
tests to predict for example, statistical features of the 
time to death or time to relapse of a disease. For 
example, the data from a subset of components of a system 
may be obtained from a DNA microarray. This data may be 
used to predict a clinically relevant event such as, for 
example, expected or median patient survival times, or to 
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predict onset of certain symptoms, or relapse of a 
disease. 

In this way, the present invention identifies preferably 
a minimum number of components which can be used to 
predict the distribution of the time to an event of a 
system. The minimum number of components is 
"predictive" for that time to an event. Essentially, from 
all the data which is generated from the system, the 
method of the present invention enables identification of 
a minimum number of components which can be used to 
predict time to an event. Once those components have 
been identified by this method, the components can be 
used in future to predict statistical features of the 
time to an event of a system from new samples. The 
method of the present invention preferably utilises a 
statistical method to eliminate components that are not 
required to correctly predict the time to an event of a 
system. 

As used herein, "time to an event" refers to a measure of 
the time from obtaining the sample to which the method of 
the invention is applied to the time of an event. An 
event may be any observable event. When the system is a 
biological system, the event may be, for example, time 
till failure of a system, time till death, onset of a 
particular symptom or symptoms, onset or relapse of a 
condition- of disease, change in phenotype or genotype, 
change in biochemistry, change in morphology of an 
organism or tissue, change in behaviour. 

The samples are associated with a particular time to an 
event from previous times to an event. The times to an 
event may. be times determined from data obtained from, 
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for example, patients in which the time from sampling to 
death is known, or in other words, "genuine" survival 
times, and patients in which the only information is that 
the patients were alive when samples were last obtained, 
or in other words, "censored" survival times indicating 
that the particular patient has survived for at least a 
given number of days. 

In one embodiment, the input data* is organised into an 
N xp data. matrix X = {x iJ ^ with N test training samples and 
p components. Typically, p will be much greater than 
N . 

For example, consider an Nxp data matrix X = {xy} from, 

for example, a microarray experiment, with N individuals 
(or samples) and the same p genes for each individual. 
Preferably, there is associated with each individual 
i (i = 1,2,--, N) a variable ' y t ( y t >0 ) denoting the time to an 

event, for example, survival time. For each individual 
there may also be defined a variable that indicates 
whether that individual's survival time is a genuine 
survival time or a censored survival time. Denote the 
censor indicators as Cj where 

)i, if y i is uncensored 
0, if yi is censored 

The Nxl vector with survival times y g may be written as y 
and the Nxl vector with censor indicators c^as c. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a 
priori assumption that most of the component weights are 
zero. 
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Preferably, the prior specified for the component weights 
is of the form 

P(A.Pt.-",fi k )=tfjP(fi t \T,)p(T,)dT (ID) 

where fi l9 fi 2) - : - 9 /3 n are component weights, p (A\ r i) 
is JV(0,r?) andP(T f )al/rf is a Jeffreys prior (Kotz and 
Johnson, 1983) . 

The likelihood function defines a model which fits the 
data based on the distribution of the data. Preferably, 
the likelihood function is of the form: 

N 

Log( Partial ) Likelihood = £ Si ( J3, <p; X, y, c) ( 2D ) 

where f = {fa,p 2 r",P p ) and /=(^^-".^)are the 

model parameters. The model defined by the likelihood 
function may be any model for predicting the time to an 
event of a system. 

In one embodiment, the model defined by the likelihood is 
Cox's proportional hazards model. Cox's proportional 
hazards model was introduced by Cox (1972) and may 
preferably be used as a regression model for survival 

data. In Cox's proportional hazards model, /? r is a vector 
of (explanatory) parameters associated with the 
components. Preferably, the method of the present 
invention provides for the parsimonious selection (and 

estimation) from the parameters f} T ={Pi t P 2 ,-',Pp ) for Cox' s 
proportional hazards model given the data X , y and c . 

Application of Cox' s proportional hazards model can be 
problematic in the circumstance where different data is 
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obtained from a system for the same survival times, or in 
other words, for cases where tied survival times occur. 
Tied survival times may be subjected to a pre-processing 
step that leads to unique survival times. The pre- 
processing proposed simplifies the ensuing algorithm as 
it avoids concerns about tied survival times in the 
subsequent application of. Cox's proportional hazards 
model . 

The pre-processing of the survival times applies by- 
adding an extremely small amount of insignificant random 
noise. Preferably, the procedure is to take sets of tied 
times and add to each tied time within a set of tied 
times a random amount that is drawn from a normal 
distribution that has zero mean and variance proportional 
to the smallest non-zero distance between sorted survival 
times. Such pre-processing achieves an elimination of 
tied times without imposing a draconian perturbation of 
the survival times. 

The pre-processing generates distinct survival 
times. Preferably, these times may be ordered in 

increasing magnitude denoted as t "(*(i)'*(2)*"" # '(JV))' > \i) ■ 

Denote by Zthe Nxp matrix that is the re -arrangement 
of the rows of X where the ordering of the rows of 
Z corresponds to the ordering induced by the ordering of 
I ; also denote by Zj the j th row of the matrix Z . Let d 

be the result of ordering c with the same permutation 
required to order £ . 

After pre-processing for tied survival times is 
taken into account and reference is made to standard 
texts on survival data analysis (eg Cox and Oakes, 1984), 
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the likelihood function for the proportional hazards 
model may . pref erably be written as 



N 



exp(Zj/}} 
X exp{z i p) 



(3D) 



where /f = :fi n ) < *i = the j th row of Z, and 

5R y = {i:i = j,j + l, - ,N}= the risk set at the j' th ordered event 

time t(J\ . 

The logarithm -of the likelihood (ie l=log(L)) may 
preferably be written as 



l{i\fi)-t,di 
i=l 



f f 
ZtP-log 



\ 



N 
i=7 



f 



ZiP-log 



N 



Xtijexp^jP) 



(4D) 



where 

6 



■ [0.i£j<i 
iJ |/, ify >i 



Notice that the model is non-parametric in that the 
parametric form of the survival distribution is not 
specified - preferably only the ordinal property of the 
survival times are used (in the determination of the risk 
sets) . As this is a non-parametric case <p is not 
required (ie qr=0) . 



In another embodiment of the method of the 
invention, the model defined by the likelihood function 
is a parametric survival model . Preferably, in a 
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parametric survival model, fl is a vector of (explanatory) 

parameters associated with "the components, and cp T is a 

vector of parameters associated with the functional form 
of the survival density function. 

Preferably, the method of the invention provides 

for the parsimonious selection (and estimation) from the 

parameters /?^and the estimation of q> T =(<Pi><P2> t "> ( Pq y ) f° r 

parametric survival models given the data X , . y and c . 

In applying a parametric survival model, the survival 
times do not require pre-processing and are denoted as y . 

The parametric survival model is applied as follows: 

Denote by /(y/^^zjthe parametric density function of the 

survival time, denote its survival function by 

00 

§[y;<p,fi f X}= ^f{u;(p,f} t X^fiu where <p are the parameters 

y 

relevant to the parametric form of the density function 
and ^Jare as defined above. The hazard function is 

defined as h(y i ;<p,fi f X)^f{y i ;(p > fi l X)/s(y i ;(p t fi t X). 

Preferably, the generic formulation of the log- likelihood 
function, taking censored data into account, is 
N 

l =^{ci hg(f(y i ;<p,fi,x))+(l-c i )log{s(y i ;<p,fi,x))} 

Reference to standard texts on analysis of survival 
time data via parametric regression survival models 
reveals a collection of survival time distributions that 
may be used. Survival distributions that may be used 
include, for example, the Weibull, Exponential or Extreme 
Value distributions. 
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If the hazard function may be written as 

and f(yt;v,fi.x) = A(yt;v)etf^ where 

yl (^ ; ?) = flM" ; ?) rfMis the inte 9 r ated hazard function and 

, v <m (;>/;$>) 

^ W = 3 — is the 1 th row of JC. 

^ ~ J dy { 

The Weibull, Exponential and Extreme Value distributions 
have density and hazard functions that may be written in 
the form of those presented in the paragraph immediately 
above . 

The application detailed relies in part on an 
algorithm of Ait ken and Clayton (1980) however it permits 
the user to specify any parametric underlying hazard 
function. 

Following from Aitkin and Clayton (1980) a preferred 
likelihood function which models a parametric survival 
model is: 



N 

<=£ 

1=1 



<Htog{fii)'-Mi+Ci 



log 



(5D) 



where ft ^A^y^expfap) . Aitkin and Clayton (1980) note 
that a consequence of equation (5D) is that the c t ' s may 
be treated as Poisson variates with means /*,and that the 
last term in equation (11D) does not depend on /? 
(although it depends on <p ) . 

Preferably, the posterior distribution of J3 , <p and r 
given y is 

p {p>vAz) a L (y\^) p (^k) p (i) (6d> 
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wherein £(j^,<?) is the likelihood function . 

In one embodiment, r may be treated as a vector of 
missing data and an iterative procedure used to maximise 
equation (6D) to produce a posteriori estimates of ft . 

The prior of equation (ID) is such that the maximum a 
posteriori estimates will tend to be sparse i.e. if a 
large number of parameters are redundant, many components 
of (3 will be zero. 

Because a prior expectation exists that many components 
T 

of J3 are zero, the estimation may be performed in such a 
way that most of the estimated p t ' s are zero and the 
remaining non-zero estimates provide an adequate 
explanation of the survival times. 

In the context of microarray data this exercise 
translates 

to identifying a parsimonious set of genes that provide 

. an 

adequate explanation for the event times. 

As stated, above, the component weights which maximise the 
posterior distribution may be determined using an 
iterative procedure. Preferable, the iterative procedure 
for maximising the posterior distribution of the 
components and component weights is an EM algorithm, such 
as, for example, that described in Dempster et al, 1977. 

In one, embodiment, the EM algorithm comprises the steps: 



WO 03/034270 



PCT/AU02/01417 



66 

1. Initialising the algorithm by setting n=0, S 0 = {l r 2,..., 
p }, initialise fi^ = /?* , <p(°) , 

2 . Defining 

0, otherwise 

and let P n be a matrix of zeroes and ones such that 
the nonzero elements y^ of /K n ) satisfy 

2 " "~ ' - (7D) 



3 . Performing an estimation step by calculating the 
expected value of the posterior distribution of component 
weights. This may be performed using the function: 

G(£l£ ( " ) ,?> ( " ) ) = E{lo g (p{j3,<p,T\y))\y,pl n) ,<p M } 

i »(r V (8D) 

where / is the log likelihood function of y . Using 
J3 = P n y and fi in) = P n y {tt) we have 



f V 

fi(riy w f f» w ) =/ai^?> ( " ) )-^Z'— 1 

^ 1=1 



(9D) 



4. Performing the maximisation step. This may be 

performed using Newton Raphson iterations as 
follows: 

Set 70 = 7^ and for r=0,l,2,... 

y M = y r + a r 5 r where a r is chosen by a line search 

algorithm to ensure Q(y r+} \ r l "\f ln) ) > Q{r r I P 00 ,? 00 ) < 
and 
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5 r =diag( r <->)[-<Kag(y<»))Adiag(^>) + 7T'(J--% 

- d y r - Qy^ yW • 

, a/ nT di e 2 i , a 2 / 

^ '¥ ' ^^^ £ ° r ; ;' 10D) 

* . •; [$ 

Let ^ be the value of y r when some convergence 

criterion is satisfied e.g ||y r _/ r+1 || < e (for example'- ■ 
s = 10- 5 ) . ' : ; 

5 - Define ff=P n y* , = jif :|#-| >s l max\p J | where q is a' . 
small constant, say 10" 5 .'' Set n=n+l, choose 

(n+l) (n) , / * («)\ ' * d/(v|P„/,$>) 

/ =9 ,w +/c n 9 > where p satisfies — ^ = — =^ = 0 

\~ ,~ I ~ dq> ^ 

and /c„ is a damping factor such that 0<sr„<l. 

6. Check convergence. If ||<s 2 where s 2 is suitably 

small then stop, else go to step 2 above.. 

In another embodiment, step (4) in the maximisation step 

d 2 \ 

may be estimated by replacing — — with its expectation 

«£-}. 

9 Yr 

In one embodiment, the EM algorithm is applied to 
maximise the posterior distribution when the model is 
Cox's proportional hazard's model. 

To aid in the exposition of the application of the EM 
algorithm when the model is Cox's proportional hazards 
model, it is preferred to define "dynamic weights" and 
matrices based on these weights. The weights are - 
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w «\/=17— — — — • 
7=1 



N 

2=1 



w/ = d[ — W{ . 

Matrices based on these weights are 

r Wiil ^ 



W2 



V)- 



f * n ^ 



0 



•• w n) 



N 



W =Yu d i W i W i 
i=l 

In terms of the matrices of weights the first and 
second derivatives of / may be written as - 



dl T ~ 
— = Z T W 



d 2 l 



dfi 1 



= Z T [w** - a{w* ]jz = Z T KZ 



(11D) 
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where K=W** —A^W*^.' Note therefore from "the 

transformation matrix P n described as part of Step (2) of 

the EM algorithm (Equation 7D) (see also Equations (10D) ) 
it follows that 



dl _ or dl 



dy r n dft 

d 2 i ^ a 2 / 



n 



(12D) 



Preferably, when the model is Cox's proportional hazards 
model the E step and M step of the EM algorithm are as 
follows: 

1. 1. Set n=0, S 0 = {1,2, p} . Let v be the vector 

with components 

v — ( l ~ s . # c i= l 
v z — \e , if c,=0 

for some small 8 , say .001. Define f to be log(v/t) . 
If p < N compute initial values 0* by 

A* =(z T z+Aiy l z T f 

If p > N compute initial values yfl*by 

/?*=-(/ -z T (zz T +xiy l z)z T f 

^ X 

where the ridge parameter X satisfies 0 < X < 1. 
2 . Define 

0, otherwise 
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Let P n h& a matrix of zeroes and ones such that the 
nonzero elements y^of fi^ satisfy 



r = PIP , P = P„r 

3 . Perform the E step by calculating 
Q(P \ P^) = E{log(p(/3,<p,T | *)) | 

Pi 1 



= t(i\P)~i 



(«) 



where 1 is the log likelihood function of t given by- 
Equation (8D) . Using P = P n 7 and 0 M = P n y M we have 

OH/"') -Ktii^-II^J 

4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set and for r=0,l,2,... 

?r+l = Yr + a r $r v/here a r is chosen by a line search 
algorithm to ensure Q(r r+i \ r { "\<P in) ) > Q(r r I ^"W"-) • 
For p < N use 

fi. =^( r W)(7^7 + /)- 1 ^pr_^( 1 / r («)j^ > 

where y = Zi^diag (yM j . 
For p > N use 

Let y* be the value of y r when some convergence 
criterion is satisfied e.g |.j y r - y r+1 1 | < e (for example 
10" 5 ) . 
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5. Define fl* =P n7 * , S n = |* >e 1 max\fi J || where s } is a 

small constant, say This step eliminates, variables 

with very small coefficients. 

6. Check convergence. If ||/-^ll<* 2 where s 2 is 
suitably small then stop, else set n=n+l, go to step 2 
above and repeat procedure until convergence occurs. 

In another embodiment the EM algorithm is applied to 
maximise the posterior distribution when the model is a 
parametric survival model . 

In applying the EM algorithm to the parametic survival 
model, a consequence of equation (5D) is that the d's may 
be treated as Poisson variates with means Hi and that the 
last term in equation (5D) does not depend on (although 
it depends on <p) , 

Note that logi^^log^y^YXifi and so it is 

possible to couch the problem in terms a log- linear model 
for the Poisson- like mean. Preferably, an iterative 
maximization of the log-likelihood function is performed 
where given initial estimates of <p the estimates of p 
are obtained. Then given these estimates of fi , updated 
estimates of <p are obtained. The procedure is continued 
until convergence occurs. 

Applying the posterior distribution described above, 
we note that (for fixed <p) 

30 M dfi^dfi M d/3 d j3 ~ Xi U3D) 

Consequently from -Equations (11D) and (12D) it follows 
that- • - • 
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| = ^( £ ^) ana ^-X^ {e )jC. 



The versions of Equation (12D) relevant to the parametric 
survival models are 



(14D) 



To solve for <p after each M step of the EM algorithm 
(see step 5 below) preferably put 

= ^0 + Kn L* -^W) wh ere <p satisfies -* = 0 for 

0</c n <l and (3 is fixed at the value obtained from the 
previous M step. 

It is possible to provide an EM algorithm for 
parameter selection in the context of parametric survival 
models and microarray data. Preferably, the EM algorithm 
is as .follows: 

1. Set n=0, So = {1,2, , p) ? ( /m " rfa/ ) = 9? ( 0 ). L et v be the 
vector with components 

v i Xe , if C/ =0 
for some small s , say for example .001. Define f to be 
log(v/A(y / q>) ) . 



If p ^ N compute initial values ft by p* =(X T X X T f 

If P > N compute initial values f? by 

p =\(i -x T {xx T +xiyx)x T t 

where the ridge parameter X satisfies 0 < X <, 1 . 
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2 . Define 



0, otherwise 

Let i^be a matrix of zeroes and ones such that the 
nonzero elements /"^of J3^ satisfy 

3. Perform the E step by calculating 
GQ? I ^" , .<? W ) = ^{log(p(A ? ,r | y)) | yjP {n) ^) 



'A* 



.w 



where f- 1 is the log likelihood function of y and qffl 
Using P =P n y and fi w = P Y W we have 



eCr|y<->, ? W) = /( v |Pj^< n >)-l£ 

2 , =1 



7i 



4. Do the M step. This can be done with Newton Raphson 
iterations as follows . Set y Q =y^ and for r=0, l,2,_. 
Yr+\ = Yr + a r where a r is chosen by "a line search 
algorithm to ensure o[y r+i I r 00 .? 00 ) >Q(r r I r W ,?» ( " ) ) - 

For p < itf use 
£ = -diag(/ n) )[y;diag(^ 
where 7 = -*P„diag ( j/ (n) ) . 

For p > N use 
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Let y* be the value of y r when some convergence 

criterion is satisfied e.g | | y r - y r+1 | | < e (for example 
1(T 5 ) . 

5. Define jl*=P n y* t . S„=j*:|AI >f 1 max| > 3y|| where e x is a 
small constant, say 10~ 5 . Set . n=n+l , choose 

I * m\ ^ * \ . dl (y\ p nZ><p) 

9> -<P XJ +K n \<P -<P KJ \ where q> satisfies — — = 0 

\~ ~ ) d<p 

and K n is a damping factor such that 0<*: n <l. 

6. Check convergence. If \\y* - y^ \\< e 2 where s 2 is suitably 
small then stop, else go to step 2. 

In another embodiment, survival times are described 
" . by a Weibull survival density function. For the Weibull 
case q> is preferably one dimensional and 

• A(y;<p) = y a , 

*{y;<p)=*y a -\ 

(p-a 

dl N ^ 

Preferably, — = — + £( C/ -/^)log{y t ) = 0 is solved 

a a 1=1 

after each M step so as to provide an updated value of a . 
Following the steps applied for Cox's proportional 
hazards model, one may estimate a and select a 
parsimonious subset of parameters from fi that can 

provide an adequate explanation fdr the survival times if 
the survival times follow a Weibull distribution. 

Features and advantages of the present invention will 
become apparent following a description of examples. 
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EXAMPLES 

EXAMPLE 1: Two group Classification for Prostate Cancer 
using a Logistic regression model 

In order to identify subsets of genes capable of 
classifying tissue into prostate of non-prostate groups, 
the microarray data set reported and analysed by Luo et 
al . (2001) was subjected to analysis using the method of 
the invention in which a binomial logistic regression Was 
used as the model . This data set involves microarray data 
on 6500 human genes. The study contains 16 subjects 
known to have prostate cancer and 9 subjects with benign 
prostatic hyperplasia. However, for brevity of 
presentation only, 50 genes were selected for analysis. 
The gene expression ratios for all 50 genes (rows) arid 25 
patients (columns) are shown in Table 4. 

The results of applying the method aire given below. The 
model had G=2 classes and commenced with all 50 genes as 
potential variables (components or basis functions) in 
the model. After 21 iterations (see below) the 
algorithm found 2 genes , (numbers 3 6 and 47 of table 5) 
which gave perfect classification. 

To determine whether the result was an artefact due to 
the large number of genes (variables) available in the 
data set, we ran a permutation test whereby the class 
labels were randomly permuted and the algorithm 
subsequently applied. This was repeated 200 times. 
Figure 1 gives a histogram of the number of cases 
correctly classified. The 100% accuracy for the actual 
data set is in the extreme tail of the permutation 
distribution with a p value of . 015. This suggests * the 
results are not due to chance. 
/->'. 

The iteration details for the unpermuted data are shown 
below: 
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Iteration 1 : 13 cycles, criterion --0.127695594930065 

misclassif ication matrix 
12 

1 16 0 

2 0 9 

row =true class 

Class 1 Number of basis functions in model : 50 
********************* ************************** 

Iteration 2 : 7 cycles, criterion -1.58111247310685 

misclassif ication matrix 
12 " 

1 16 0 

2 0 9 

row =true class 

Class 1 Number of basis functions in model : 50 
***************** ****************************** 

Iteration 3 : 5 cycles, criterion -2.82347006228686 

misclassif ication matrix 
12 

1 16 0 

2 0 9 

row =true class 

Class .1 Number of basis functions in model : 45 
******************** *************************** 

Iteration 4 : 4 cycles, criterion -3.0353135992828 

misclassif ication matrix 
1 2 
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1 16 0 

2 0 9. 
row =true class 

Class 1 : Variables left in model 

2 3 8 9 11 12 17 19 23 25 29 31 36 40 42 45 47 48 49 
regression coefficients 

-0.00111392924172276 - 3 . 66542218 865611e- 007 - 
1.18280157375022e-010 -1 . 15853525792239e-008 - 
1 .236113 88510839e-01 

0 -1.99942263084115e-008 -0.0003 5412991046087 - 
0.844161298425504 -7 . 02985067116106e- 011 

7 . 92510183180024e-011 . 

-0.000286751487965763 - 8 . 122734 562444 63e- 008 - 

4 . 57102 500405226 -0.00 047478160104378 7 2 . 8167 09124774 82e- 

011 -1.0 

2591823605395e-008 1.2 0451375402485 -0.012 082 56 67151016 - 
0.000171130745325351 

******************************************** * * * 
Iteration 5 : 4 cycles, criterion -2.82549351870821 

misclassif ication matrix 
1 2 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
2 17 19 29 36 40 47 48 49 
regression coefficients 

-1 . 01527560660479e-006 - 6 . 47965734465 8 26e- 0 08 - 

0. 36354429595162 -2 . 964343 903 827 85e- 0 0 8 -5.84197907608526 

-8.399 
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36030488418e-008 1.22712881145334 -0.000419632844432 0 7 - 
5 .78172364089109e-008 

*********************************************** 
Iteration 6 : 4 cycles, criterion -2.49714605824366 

misclassif ication matrix 
12 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
19 36 47 48 

regression coefficients 

-0.0598894592370422 -6.9513 0027598 687 1.31485208225331 - 
4 . 34828258633208e-007 

********* ************************************** 
Iteration 7 : 4 cycles, criterion -2.20181629904024 

misclassif ication matrix 
12, 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
19 36 47 

regression coefficients 

-0.00136540505944133 -7.6140010860 9408 1.40720739106609 



******************************************* 
Iteration 8 : 3 cycles, criterion -2.02147819230974 
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misclassif ication matrix 
12 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
19 36 47 

regression coefficients 

-6.3429997893986e-007 -7,98154 60139979 1.47084153596716 

***************************** ****************** 
Iteration 9 : 3 cycles, criterion -1.92333435556147 

misclassif ication matrix 

12 
116 0 
2 0 9 

row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 
-8.19142602569327 1.5085 64263 8118 9 

****** ***************************************** 
Iteration 10 : 3 cycles, criterion -1.86996621406647 

misclassif ication matrix 
12 

1 16 0^ 

2 0 9 

row =true class 
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Class 1 : Variables left in model 
3 6 .47 

regression coefficients 
-8.30998234780385 1.52 999314 044398 

**************************************** *** * * * * 
Iteration 11 : 3 cycles, criterion -1 . 84085525990757 

misclassif ication matrix 
1 2 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 
-8.37612256703144 1.54195 991212442 

*********************** *>******* **************** 
Iteration 12 : 3 cycles, criterion -1.824943 85332917 

misclassif ication matrix 
12 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 

-8 .412 7331-00 98038 1 . 5485856404 6418 
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************ ************,*** ********** ********* * 
Iteration 13 : 2 cycles, criterion -1.81623665404495 

misclassif ication matrix 
12 

1 16 0 

2 0 9 

row =true class 

■ ■ )" ••" 

Class 1 : Variables left in model 
36 47 • ~ 

regression coefficients 
-8.43290814197901 1.55223 72 8701224 

v 

■ f. * 

****** ***************************************** 

Iteration 14 : 2 cycles, criterion -1.8114685821343 4 

misclassif ication matrix 
12. 
- 1 16 0 
2 0 9 

row =true class 

Class 1 : Variables left in. model 
36 47 

regression coefficients 
-8.44399866057439 1.5542447583578 

********************* ************************** 
Iteration 15 : 2 cycles, criterion -1.80885659137866 



misclassif ication matrix 

12 
1 16 0 
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2 0 9 

row =true class 

Class 1 : Variables left in model • 
36 47 

regression coefficients 
-8.45008701361215 1.55534682956666 

*************************** ******************** 
Iteration 16 : 2 cycles, criterion -1.80742542023794 

misclassif ication matrix 
12 

1 16 0 

2 0 9 n 
row =true class 

Class 1 : Variables left in model 
36 47 ~ 
regression coefficients 
-8.45342684192637 1.5559513 9130677 

***************** ****************************** 
Iteration 17 ; 2 cycles, criterion -1.80664115725287 

misclassif ication matrix 
1 2 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 



WO 03/034270 

-8 . 45525819006111 1 . 55628289706596 



PCT/AU02/01417 



****************************** ***************** 
Iteration 18 : 2 cycles, criterion -1.80621136412041 

misclassif ication matrix 

12 > . 

1 16 0 

2 0 9 

row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 
-8.45626215911343 1.556464633 70405 

*********************************************** 
Iteration 19 : 2 cycles, criterion -1.80597581993879 

misclassif ication matrix 

1 16 0 

2 0 9 

. row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 
-8.45681248047617 1.55656425211947 

********* ********************* ********* ******** 
. Iteration 20 : 2 cycles, criterion -1.80584672964066 



misclassif ication matrix 
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84 

12 
116 0 

2 0 9.. 
row =true class 

Glass 1 : Variables left in model 
36 47 

regression coefficients 
-8.45711411647011 1.55 6618853 92712 

******************** *************************** 
Iteration 21 : 2 cycles, criterion -1.8057759807905 6 

miscla.ssif ication matrix 
12 

1 16 0 

2 0 9' 

row =true class 

Class 1 : Variables left in model 
36 47 

regression coefficients 
-8.45727943971564 1.5566487805773 



WO 03/034270 



85 



PCT/AU02/01417 



Table 4 





Disease 














State 


















PC 


PC 


PC 


PC 


PC 


PC 


PC 


PC 


Gene 1 


0.84 


0.77 


1.08 


0.89 


0.54 


0.78 


0,81 


1.1 


Gene 2 


0.93 


0.92 


0,67 


1.05 


0.62 


0.47 


0.57 


0.46 


Gene 3 


0.25 


0.24 


0.6 


0.94 


0.9 


0.59 


1.05 


1.37 


Gene 4 


1.02 


0.86 


0.76 


1.11 


1.12 


0.86 


0.83 


1.6 


Gene 5 


0.49 


1.4 


0.79 


2.45 


1.14 


1.45 


0.43 


2.07 


Gene 6 


1.05 


1.36 


0.97 


0.88 


1.09 


0.76 


1.08 


0.49 


Gene 7 


0.77 


1.07 


0.95 


0.76 


0.75 


0.19 


0,64 


0.34 


Gene 8 


0.89 


3.92 


1.11 


0.8 


0,63 


1.65 


1.01 


1.23 


Gene 9 


1.39 


0,85 


1.34 


1.58 


2.15 


2.25 


1.63 


1.24 


Gene 10 


0.63 


0.88 


0.56 


0.94 


0.67 


0.42 


J 0.6 


0.42 


Gene 11 


0.6 


0.62 


0.75 


0.64 


0.49 


0.81 


0.72 


0.82 


Gene 12 


0.84 


0.15- 


0.67 


0.84 


0.79 


0.93 


0.61 


0.77 


Gene 13 


1.24 


1.27 


1.18 


1,87 


1.02 


1.04 


1.3 


0.65 


Gene 14 


1.23 


1.04 


0.97 


0.87 


0.81 


0.95 


1.17 ' 


1.13 


Gene 1 5 


1.61 


1.11 


1.33 


0.83 


0.99 


0.63 


0.96 


0.72 


Gene 16 


0.59 


0.68 


1 


1.11 


1.39 


0.86 


0.86 


0.63 


Gene 17 


0.47 


0.7 


0.63 


0.76 


0.79 


1.28 


0.56 


0,69 


Gene 18 


1.4 


1.4 


0.6 


0.88 


1.33 


1.61 


2.05 


1.05 


Gene 19 


0.99 


0.84 


0.86 


0.76 


0.43 


0.79 


0.61 


0.96 


Gene 20 


0.73 


0.92 


0.73 


0.73 


0.67 


0.61 


0.81 


0.91 


Gene 21 


1.06 


1.07 


0.85 


1.06 


0.79 


1.46 


0.76 


1.1 


Gene 22 


1.08 


0.67 


1.16 


2.3 


0.85 


1.55 


1.29 


1.15 


Gene 23 


1.29 


0.65 


1.09 


0.86 


0.74 


1.09 


1 


1.01 


Gene 24 


0.9 


1 


1.04 


1.08 


0.92 


0.99 


0.79 


0.93 


Gene 25 


1.25 


1.07 


1.22 


0.94 


1.35 


1.19 


0.98 


1.54 


Gene 26 


0.9 


1.34 


1.13 


0.95 


0.53 


1.5 


0.94 


0.8 


Gene 27 


0.3 


0.51 


1.45 


0.92 


1.33 


1.61 


0.33 


0.42 


Gene 28 


0.39 


0.71 


0.68 


0.57 


0.55 


0.57 


0.6 


0.46 


Gene 29 


. 1.48 


0.67 


0.71 


1.14 


0.95 


1.21 


0.65 


0.74 


Gene 30 


0.9 


0.34 


0.9 


1.1 


, 0.97 


1.01 


0.97 


1.06 


Gene 31 


1.16 


5.61 


0.67 


1.03 


0.73 


1.65 


1.14 


0.55 


Gene 32 


0.88 


0.86 


1.09 


6.96 


0.58 


1.27 


0.94 


0.76 



WO03/034270 



PCT/AU02/01417 



86 





Disease 
















State 




• ■ 














PC 


PC 


PC 


PC 


PC 


PC 


PC 


PC 


Gene 33 


0.73 


0.42 


1.53 


0.55 


0.43 


0.69 


0.66 


1.27 


Gene 34 


0.84 


0.76 


0.72 


1.61 


0.73 


1.76 


0.82 


1.88 


Gene 35 


2.63 


1.55 


0.31 


0.66 


0.49 


1.62 


0.82 


1.94 


Gene 36 


0.15 


0.16 


0.1 


0.22 


1.06 


0,12 


0.22 


0.08 


Gene 37 


3.01 


0.76 


1.28 


0.76 


0.24 


2.35 


0.52 


0.4 


Gene 38 


1.46 


0.98 


0.94 


0.99 


1.03 


1.51 


1.33 


1.88 


Gene 39 


0.87 


0.59 


0,84 


1.47 


0.62 


137 


1.15 


1.56 


Gene 40 


0.77 


0.93 


0.92 


1.23 


0.86 


0.89 


0.59 


0.82 


Gene 41 


1.15 


0.43 


0.47 


1 


0.67 


0.33 


0.48 


0.29 


Gene 42 


1.12 


0.91 


0.71 


0.63 


1.06 


0.61 


0.81 


0.78 


Gene 43 


0.86 


0.97 


1.24. 


1.09 


0.66 


1 


1.28 


0.47 


Gene 44 


1.33 


1.12; 


1.10 


0.92 


1.43 


1.12 


1.15 


0.97 


Gene 45 


1.41 


1.15 


1.31 


1.32 


1.32 


1.49 


1.43 


1.4 


Gene 46 


1.14 


1.18 


0.86 


0.99 


0,88 


0.97 


0.92 


1.32 


Gene 47 


5.08 


4.95 


7.08 1 1 .26 


7.59 


9.59 


2.68 


2.55 


Gene 48 


0.66 


0.72 


1.18 


0.92 


0.91 


1.27 


1.16 


1.27 


Gene 49 


1.06 


1.15 


1.37 


1.67 


1.05 


0.92 


1 


0.96 


Gene 50 


32.91 


12.32 


8.35 


4.93 


10.99 14.22 


4.72 


3.15 



Disease 
State 





-- PC 


PC 


PC 


PC 


PC 


PC 


PC 


PC 


Gene 1 


1.24 


1.43 


0.43 


1.26 


0.89 


1.16 


1.31 


2.3 


Gene 2 


0.3 


0.82 


2.55 


0.39 


0.87 


1.16 


0.55 


0.63 


Gene 3 


1.17 


0.58 


0.5 


0,6 


0.36 


1.85 


0.72 


1.07 


Gene 4 


1.56 


1.24 


1.34 


1.84 


1.08 


1.06 


1.47 


0.87 


Gene 5 


0.69 


0.92 


1.16 


1.94 


1.34 


0.92 


1.42 


6.99 


Gene 6 


0.23 


0.98 


0.57 


0.71 


0.57 


0.73 


0.81 


0.84 


Gene 7 


0.4 


3.68 


0.49 


0.23 


1.05 


0.54 


0.79 


1.34 


Gene 8 


1.23 


0.61 


2.04 


1.3 


0.79 


1.32 


3.96 


1.64 


Gene 9 


0.69 


1.15 


2.6 


2.24 


1.95 


1.47 


1.3 


1.54 


Gene 10 


0.48 


0.39 


0.44 


0.8 


0.58 


0.79 


0.42 


1.85 


Gene 11 


0.57 


0.58 


0.82 


0.69 


0.67 


0.6 


0.77 


1.09 


Gene 12 


0.49 


0.94 


0.85 


0.81 


1.04 


0.83 


0.83 


0.35 
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Disease 
State 

PC PC 


PC 


PC 


PC 


PC 


PC 


PC 


Gene 13 


1.02 


1,16 


0.76 


1.49 


1;38 


1.29 


1.47 


1.19 


Gene 14 


1.15 


0.85 


1.38 


1.23 


2.06 


0.72 


1.16 


0.98 


dene 15 


0.2 . 


0.52 


1.1 


0.39 


0.76 


0.37 


1.18 


2,06 


Gene 16 


0.68 


1.32 


0.99 


0.78 


1.16 


0.9 


1.03 


1.67 


Gene 17 


0.41 


0.73 


1.25 


0.79 


0.9 


0.55 


0.93 


0.68 


Gene 18 


0.25 


0.56 


1.71 


0.86 


3.07 


0.99 


2.42 


2.28 


Gene 19 


0.48 


0.48 


0.94 


0,1 


0.45 


0.36 


0.37 


1.06 


Gene 20 


0,46 


0.5 


0.46 


0.4 


0,47 


0 ? 78 


0.57 


1.31 


Gene 21 


1.19 


1.55 


1.16 


1.27 


1.54 


0.93 


1.61 


0.36 


Gene 22 


2 


0.84 


0.86 


1.7 


1.01 


0,6 


2.22 


0^99 


Gene 23 


1.03 


0.63 


1.45 


0.72 


0.94 


1.94 


1.06 


1.21 


Gene 24 


0.87 


1.11 


0.86 


1.37 


1.18 


0.8 


1.19 


1.74 


Gene 25 


2:24 


1.29 


1.27 


0.9 


1.46 


1.02 


1.04 


1.27 


Gene 26 


0.28 


0.75 


0.89 


0.85 


0.66 


1.52 


0.43 


0.58 


Gene 27 


6.08 


0.41 


0.43 


5.22 


3 


1.85 


0.17 


0.91 


Gene 28 


0.4 


1.07 


0.93 


1.63 


0.92 


0.46 


0.67 


0.95 


Gene 29 


2.66 


0.67 


0.84 


2.46 


0.74 


1.5 


1.86 


2.41 


Gene 30 


1.17 


0.55 


0.83 


0.98 


1.12 


1.52 


1.29 


1.01 


Gene 31 


0.43 


0.3 


0.56 


1.68 


0.81 


0.83 


1.33 


1.39 


Gene 32 


0.59 


1.1 


1.86 


1.08 


1.32 


0.59 


1.17 


0.65 


Gene 33 


1.16 


0.63 


0,81 


1.04 


0.56 


0.25 


0.61 


0.26 


Gene 34 


1.32 


0.63 


1.18 


0.82 


0.73 


0.23 


0.81 


0.45 


Gene 35 


1.36 


0.91 


1.09 


1.06 


0.99 


1.16 


0.55 


2.39 


Gene 36 


0.2 


0.23 


0.11 


0.13 


0.13 


0.12 


0.24 


0.59 


Gene 37 


0.14 


3.68 


1.45 


5.22 


2.06 


2.48 


3.27 


0.59 


Gene 38 


1.64 


0.46 


2.15 


.2 


1.66 


0.87 


2.78 


1.27 


Gene 39 


1.55 


0.71 


1.1 


1.63 


1.19 


1.48 


3.31 


2.14 


Gene 40 


0.74 


0.39 


0.47 


1.14 


0.87 


0.9 


1.16 


2.42 


Gene 41 


6.08 


3.68 


1.04 


0.36 


2.03 


1.85 


. 1.24 


3.52 


Gene 42 


0.4 


4.67 


1.3 


5.22 


1 


1.07 


0.47 


3.52 


Gene 43 


0.76 


0.6 


1.14 


0.54 


0.88 


0.73 


0.93 


0.69 


Gene 44 


1.07 


0.84 


1.03 


0.95 


1.36 


0.89 


1.15 


1.20 


Gene 45 


1.16 


1.13 


. 1.25 


1.4 


1.5 


1.55 


2.21 


0.99 


Gene 46 


1.08 


0.87 


0.66 


0.79 


0.61 


1.06 


1.46 


0.98 


Gene 47 


4.29 


2.51 


5.7 


6.08 


7.01 


5.58 


6.28 


5.58 
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Disease 














State 
















PC 


PC 


PC 


PC 


PC 


PC PC 


PC 


Gene 48 


1.18 


1.22 


1.35 


1.31 


1.66 


1.2 1.13 


1.93 


Gene 49 


1.3 


0.76 


0.98 


0.58 


1.08 


0.74 0.83 


0.65 


Gene 50 


1.53 


1.79 


6.49 


5.28 


4.52 


5.41 22.03 


4.6 





Disease 
State 
BPH BPH 


BPH 


BPH 


BPH 


BPH 


BPH 


BPH 


BPH 


Gene 1 


3.91 


2.56 


0.52 


1.33 


0.93 


0.97 


1.68 


1.29 


0.98 


Gene 2 


4 


0.31 


7.02 


1.61 


0.81 


0.85 


1.06 


0.99 


0,87 


Gene 3 


0.91 


10.51 


0.57 


2.56 


1.37 


1.1 


1.2 


1.34 


0.91 


Gene 4 


0.85 


0.89 


1 


1.2 


1.05 


1.09 


1.27 


1.18 


0.68 


Gene 5 


0.91 


4.2 


0.45 


0.47 


1.11 


1.48 


0.81 


2.3 


1.13 


Gene 6 


1.72 


1.44 


1.13 


0.89 


1.03 


1.25 


1.13 


1.15 


1 


Gene 7 


,0.8 


0.74 


1.25 


1.19 


0.94 


1.01 


1.04 


0.92 


1.15 


Gene 8 


1.18 


3.69 


1.86 


0.99 


1.12 


1.46 


1.56 


1.53 


0.84 


Gene 9 


1.27 


1.28 


1.49 


1.36 


0.87 


1.21 


0.84 


1.02 


0.95 


Gene 10 


0.9 


0.99 


0.88 


0.93 


0.64 


0.87 


0.72 


0.76 


0.7 


Gene 1 1 


088 


1.12 


-1.02 


0.96 


1 


0.96 


1.1 


0.79 


0.9 


Gene 12 


1.03 


0.95 


1.11 


1.29 


0.76 


1.02 


0.93 


0.89 


1.26 


Gene 13. 


1.02 


0.91 


1,02 


0.87 


0.94 


1.04 


0.93 


0.92 


1.05 


Gene 14 


0.71 


1.32 


1.2 


0.92 


1.05 


1.02 


0.98 


0.93 


0.92 


Gene 15 


0.75 


0.82 


0.57 


0.76 


0.91 


0.76 


0.86 


1.09 


1.22 


Gene 16 


1.02 


1.05 


1.19 


1.01 


0.63 


0.99 


1.03 


1.01 


0.8 


Gene 17 


2.14 


3.42 


1.34 


1.61 


0.58 


0.86 


0.67 


0.82 


0.77 


Gene 18 


0.54 


1.74 


2.85 


0.7 


1.24 


1.05 


1.35 


1.1 


0.99 


Gene 19 


1.41 


1.27 


0.81 


0.81 


1.48 


1.19 


1.23 


1.16 


0.86 


Gene 20 


0.72 


0.77 


0.87 


0.66 


0.75 


0.87 


0.89 


0.73 


0.84 


Gene 21 


1.11 


0.63 


0.95 


1.16 


0.95 


1.16 


1.62 


1.03 


0.91 


Gene 22 


0.89 


0.91 


1.22 


1.19 


0.95 


1.24 


1.27 


1.11 


0.95 


Gene 23 


0.86 


2.77 


0.92 


1.2 


1.15 


1.72 


1.71 


1.45 


1.09 


Gene 24 


0.8 


0.87 


0.99 


0.78 


0.95 


0.87 


6.9 


0.92 


0.92 


Gene 25 


1.51 


1.17 


1.19 


1.38 


0.91 


1.21 


1.43 


1.07 


0.92 


Gene 26 


1.42 


2.33 


0.96 


1.43 


0.96 


1.42 


1.59 


1.31 


0.81 


Gene 27 


2 


0.79 


0.7 


1.18 


0.88 


0.78 


0.71 


0.93 


0.99 
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Disease 
State 

BPH BPH BPH BPH BPH BPH BPH BPH BPH 



Gene 28 


2.1 


0.76 


1,04 


0.67 


Gene 29. 


0.74 


1:2 


1.01 


1.08 


Gene 30 


1.02 


5.06 


1.13 


1.03 


Gene 31 


0.64 


2.18 


1.71 


0.87 


Gene 32 


0.94 


0.82 


1.29 


1.61 


Gene 33 


0,71 


0.65 


0.69 


0,65 


Gene 34 


1.16 


0.89 


0.85 


0.81 


Gene 35 


1.14- 


1.09 


0.72 


0.55 


Gene 36 


0.65 


0.73 


0.71 


0.45 


Gene 37 


0.79 


0.41 


0.9 


1.66 


Gene 38 


1.11 


0.78 


1.55 


0.79 


Gene 39 


0.87 


0.91 


0.93 


1.15 


Gerie 40 


0.96 


1.11 


0.76 


1.83 




1 . # o 


o.oo 


1 • 1 v> 


1 AA 


Gene 42 


0.99 


0.38 


1.72 


2.29 


Gene 43 


0.67 


0.81 


1.38 


0.8 


Gene 44 1 


0.75 


0.72 


0.62 


1.03 


Gene 45 


1.03 


0.85 


1 


0.81 


Gene 46 


0.79 


5.83 


0.65 


0.74 


Gene 47 


1.45 


1.04 


0.74 


0.91 


Gene 48 


1.47 


1.66 


1.61 


1.27 


Gene 49 


0.79 


0.79 


1.3 


0.82 


Gene 50 


3.45 


0.93 


0.85 


3.2 



1.52 1.23 1.32 1.15 0.98 

1.35 1.39 1.59 1.48 0.91 

0.49 0.81 0.67 0.61 0.64 

0.99 1.01 1.03 0.88 0.82 

0.96 1.61 1.51 1.34 1.18 

1.1 - 1:49 1.27 1.39 1.36 

0.83 0.94 0.93 0.81 0.78 

0.88 1.23 1.31 1.05 1.4 

0,98 1 1.07 1.18 1.02 

0.82 0.97 0.88 0.75 0.88 

0.89 1.12 1.64 1.35 0.64 

1.27 1.29 1.34 1.4 1.27 

0.48 0.67 1.17 0.83 0.09 

1.37 1.05 1.1 1.85 1.68 

2.96 2.77 2.44 12.77 5.04 

2.96 2.77 2.44 2 10.9 



Example 2 : Two Group Classification Using a Large Data 
set and a binomial logistic regression model . 

In order to identify subsets of genes capable of 
classifying tissue into different clinical types of 
lymphoma, the data set reported and analysed in Alizadeh, 
A. A., et al. (2000) Distinct types of diffuse large 
cell lymphoma identified by gene expression profiling. 
Nature 403:503-511 was subjected to analysis using the' 
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method of the invention in which a binomial logistic 
regression was used as* the model. 

In the data set, there are n=4026 genes and n=42 samples. 
In the following DliBCL refers to "Diffuse large* B cell 
Lymphoma". The samples have been classified into two 
disease types GC B-like DLBCLi (21 samples) and Activated 
B-like DLBCL (21 samples) . We use this set to illustrate 
the use of the above methodology for rapidly discovering 
genes which are diagnostic of different disease types. 

The results of applying the methodology are given below. 
The model had G=2 classes and commenced with all genes as 
potential variables (basis; functions ) in the model. 
After 2 0 iterations the algorithm found 2 gene, numbers 
1281 and 1312 (GENE3332X and GENE3258X) which gave the 
misclassif ication (table 5) below, and an overall 
classification success rate of 98%. This example ran in 
about 2 0 seconds on a laptop machine. 



Table 5 





Predicted class 1 


Predicted class 2 


True class 1 


20 


1 


True class 2 


0 


.21 



To determine whether the result was an artefact due to 
the large number of genes (variables) available in the 
data set, we ran a permutation test whereby the class 
labels were randomly permuted and the algorithm 
subsequently applied. This was repeated 1000 times. 
Figure 2 gives a histogram of the percent of cases 
correctly classified (lambda). The. 97.6% accuracy for the 
actual data set is in the extreme tail of the permutation 
distribution with a p value of .013. These 
observations suggests the results are not due to chance. 

Example 3 : Multi group Classification 
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In oirder to identify gejries capable of classifying samples 
into one of a multitude ? of classes, the data set reported 
and analyzed in Yeoh et al. Cancer Cell vl: 133-143 
(2 002) was subjected to analysis using the method of the 
invention in Which a likelihood was used based on a 
multinomial logistic regression. The same pre-processing 
as described in. Yeoh et al has. been applied. This 
consisted of the following: 

• drop the following 8 arrays: BCR.ABL.R4, MLL.R5, 
Normal. R4, T. ALL . R7 , ; T. ALL .R8 , Hyperdip . 50 . 2M. 3 

, Hypodip . 2M . 3 , and Hypodip . 2M. 2 

• set the. mean response value of each array to 25 0 0 

• thresholding - values over 45000 are set to 45000 
values less than 100 are set to 1 

• genes with less than 0.01 present are eliminated - 
this amounted to 16 07 genes 

• genes for which, the difference between the maximum, 
and the minimum value was less than 100 are 
eliminated (1604 genes) 

After preprocessing there are n=11005 genes and n= 24 8 
samples. The samples have been classified into 6 disease 
types : 

1. BCR-ABL; $ 

2. E2A-PBX1; 

3. Hyperdip> 50; 

4. MLL; 

5. T-ALL and 

6. TEL-AML1. 

This set was used to illustrate the use of the method for 
rapidly discovering genes which are diagnostic of 
different disease types. The results of applying the 
methodology are given below. The model had G=6 classes 
and commenced with all genes as potential 
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variables (basis functions) iri the model. After .20 
Iterations the algorithm found that the following 10 
genes separated the classes: 

X35823.at', X32562.at, X430.at, X39039.s.at, 

X397S6.g.at, X1287.at f .. X40518.at, ■ X38319.at, 

X41442.at, X1077.at. ' 

A 15- fold cross validation gave the misclassif ication 
table below (Table 6), with 94% classification success: 



Table 6 



subtype 


1 


2 


3 


4 


5 


6 


BCR.ABLi 


10 


1 


3 


1 


0 


0 


E2A.PBX1 


- 0 


27 


0 


0 


0 


0 


Hyperdip>50 


3 


0 


60 


1 


0 


0 


MLL 


1 


1 


2 


16 


1 


2- 


T-ALL 


0 


0 


1 


0 


42' 


1 0 


TEL-AMLl 


0 


0 


0 


1 


0 


79 



Confusion matrix for Multigroup classification cross-validation (15-fold) 



A permutation test (permuting the class labels) showed 
that the cross validated error rate of 0.94% is highly 
significant (p = 0.00). 

Example 4: Standard regression using a generalised linear 
model 

This example illustrates how the method can be 
implemented in a generalised linear model framework. This 
example is a standard regression problem with 2 00 
observations and 41 variables (basis functions) . The true 
curve is observed with error (or noise) and is known to 
depend on only some of the variables. The responses are . 
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continuous and normally distributed. We analyse these 
data using our algorithm for generalised linear model 
variable selection. . 

This is a generalised linear model with: 
Link function: g(jx) = fx 

Derivative of link function: — = 1 

Variance function: x 2 =1 
Scale parameter cp= a 2 

N n / v M \2 

Deviance (likelihood function) : log(g 2 ) - 0.5 *Y Ui ^ xJ 

The updating formula for a 2 is 

where \i\ is the mean evaluated at p* in step 5 of the 
algorithm. 

The output of the algorithm is given below. 



EM Iteration: 1 expected post: -55.45434 basis fns 41 
sigma squared 0.5607509 

EM Iteration: 2 expected post: -43.96193 basis fns 41 
sigma squared 0.5773566 

EM Iteration: 3 expected post: -48.87198 basis fns 39 
sigma squared 0.5943395 

EM Iteration: 4 expected post: -52.79632 basis fns 31 
sigma squared 0.6072137 

EM Iteration: 5 expected post: -55.18578 basis fns 28 
sigma squared 0.6161707 



EM Iteration: 6 expected post: -56.5303 basis fns 23 
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Sigma squared 0.6224545 

EM Iteration: 7 expected post: -57.47589 basis fns 17 
sigma squared 0.626674 

EM Iteration: 8 expected post: -58.0566 basis fns 15 
sigma squared 0.6293923 

EM Iteration: 9 expected post: -58.41912 basis fns 13 
sigma squared 0.631578 9 

EM Iteration: 10 expected post: -58.6923 basis fns 11 
sigma squared 0.633089 

EM Iteration: 11 expected post : -58.88766 basis fns 10 
sigma squared- 0 . 6343793 

EM Iteration: 12 expected post: -59.05261 basis fns 10 
sigma squared 0.635997 

EM Iteration: 13 expected post: -59.24126 basis fns 9 
sigma squared 0.6381456 

EM Iteration: 14 expected post: -5 9.4 76 68 basis fns 9 
sigma squared 0.640962 

EM Iteration: 15 expected post: -59.7677 basis fns 9 
sigma squared 0.6443392 

EM Iteration: 16 expected post: -60.10277 basis fns 9 
sigma squared 0.6477088 



EM Iteration: 
sigma squared 



17 expected post.- -60.44193 basis fns 9 
0 .6508144 
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EM Iteration: 18 expected- post: -.60.7684 basis fns 9 
sigma squared 0.6539145 

EM Iteration: 19. expected post: -61.09251 basis fns 9 
sigma squared 0.6565873 

EM Iteration: 20 expected post: -61.38427 basis fns 8 
sigma squared 0.6589498 

EM Iteration: 21 -expected post: ' -61.65061 basis fns 8 
sigma squared 0.6615976 

EM Iteration: 22 expected post: -61.92217 basis fns 8 
sigma squared 0.664281 

EM Iteration: 23 expected post: -62.17683 basis fns 7 
sigma squared 0.6663748 

EM Iteration: 24 expected post: -62.3 7402 basis fns 7 
sigma squared 0.6679655 

EM Iteration: 25 expected post: -62.51645 basis fns 7 
sigma squared 0.6689011 

EM Iteration: 26 expected post: -62.5 9567 basis fns 6 
sigma squared 0.6689011 

EM Iteration: 27 expected post: -62.6151 basis fns 6 
sigma squared 0.6690962 



EM Iteration: 28 expected post: 
sigma squared 0.6691031 



-62.61717 basis fns 6 
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EM Iteration: 2 9 expected post: -62 . 6i739 basis fris 5 
sigma squared 0.6691035 

The algorithm converges with a model involving 5 of the 
41 basis vectors (variables). A plot of the fitted curve 
(solid line) for the model with 5 variables (basis 
functions) selected by the algorithm , the true curve 
(dotted line), arid the noisy data are given in Figure 3 
where the y variable is denoted nf . 

Example 5; Small linear regression example using a 
generalized linear model 

This example is similar to example 4, but for brevity, a 
smaller number of variables (10) is used. This allows the 
full data set to be tabulated (see Table 7) .The dependent 
variable is a function of the first four variables only, 
the remaining variables are noise. 

The data were analysed as a generalised linear model, 
with identity link, constant variance, and a normal 
response. After 12 iterations the algorithm converged to 
a solution involving just the four variables, known to 
have predictive information, and discarding all six of 
the noise variables. 



Table 7. 











Predictor Variables 
































Dependent 


V1 


V2 


V3 


V4 


V5 


V6 


V7 


V8 


V9 


V1Q 


Variable 


0.778801 


0.852144 


0.913931 


0.960789 


0.990050 


1.000000 


0.990050 


0.960789 


0.913931 


0.852144 


0.378571 


0.778801 


0.697676 


0.612626 


0.527292 


0.444858 


0.367879 


0.298197 


0.236928 


0.184520 


0.140858 


2.832704 


0.105399 


0.077305 


0.055576 


0.039164 


0.027052 


0.018316 


0.012155 


0.007907 


0.005042 


0.003151 


3.359711 


0.001930 


0.001159 


0.000682 


0.000394 


0.000223 


0.000123 


0.000067 


0.000036 


0.000019 


0.000010 


2.170812 


0.000005 


0.000002 


0.000001 


0.000001 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


3.440226 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


2.424206 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-0.10464 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


.0.000000 


0.000000 


0.000000 


. 0.000000 


o.oooooc 


3.672 
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V1 


V2 


V3 


V4 


Predictor Variables 

V5 V6 


V7 


V8 


V9 


V10 


Dependent 
Variable 


o.oddood 


o.oooooo 


0.000000 


0.000000 


O.OOOOOO 


o.dodooo 


o.ooodoo 


o.oooooo 


0.000000 


o.ooodoo 


2.003438 


0.000000 


0.000000 


0:000000 


o.oooooo 


0.000000 


0.000000 


o.ooodoo 


o.ooodoo 


0.000000 


o dooooo 

w *w w wwww 


0 970833 


0,000000 


0.000000 


0.000000 


o.pooooo 


o.ooodoo 


o.oooooo 


o.oddood 


o.oddodo 


o^oooodo 


o.oooooo 


1 .28257 


0.000000 


o.oooooo 


o.oooooo 


codoodo 


o.oooooo 


0.000000 


o.oooooo 


o.ooodoo 


o.oooooo 


0.000000 


1 .085955 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooodoo 


o.dodooo 


o.oooooo 


o.ooddoo 


o.ooodoo 


-0.30299 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooodoo 


0.000000 1 


0.000000 


0.000000 


0.000000' 


0.000000 


0.050082 


o.ooodoo 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0 000000 

w» www www 


0 457228 

w«~ w 7 f r € w 


0.000000 


0.000000 


0.000000 


o.oooooo 


0.000000 


0 000000 


0.000000 


0.000000 


0.000000 


0 000000 

W* W WW w WW 


O 117205 


0,000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooodoo 


o.oooooo 


o.oddboo 


o.oooooo 


o.oooooo 


-0 22729 

w » ■ € < ^w 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooo 


o.oooooo 


0.000000 


0.000000 


0.000000 


o.oooooo 


2 094908 

» W W w w 


0.000000 


o.oooooo 


o.ooodoo 


o.oooooo 


o.ooodoo 


0.000000 


o.ooodoo 


0.000000 


0.000000 


0.000000 


1 084125 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooodoo 


0 598052 


o.pooooo 


0.000000 


0.000000 


0,000000 


0.000000 


0 000000 


0.000000 


0.000000 


0.000000 


0,000000 


-0 22954 


o.oooooo 


0.000000 


0.000000 


o.ooodoo 


0.000000 


0 000000 


0.000000 


0 000000 


0 000001 

W ■ WWWWW 1 • 


0 000001 


0.02262 


0.000002' 


0.000005 


0.000010 


0.000019 


0.000036 


0 000067 


0.000123 


0.000223 


0 000394 

W a WWW \J CTT 


0 000682 

W • w W W W V_J £_ 


-1 .5989 


0.001159 


0.001930 


0.003151 


0.005042 


0.007907 


0 012155 

WiW 1 1 VU 


0.018316 


0.027052 


0.0391 64 


0 055576 

\-/ • w w#WW 1 W 


0 1 63323 


0.077305 


0.105399 


0.140858 


0.184520 


0.-236928 


0.298197 


0.367879 


0.444858 


0.527292 


0 612626 

w»U ] £.U^U 


-0 466Q7 


0.697676 


0.778801 


0.852144 


0.913931 


0.960789 


0.990050 


1 .000000 


0.990050 


0.960789 


0 913931 

W • W 1 W>wW/ 1 


1 104fi"}fi 


0.852144 


0.778801 


0.697676 


0.612626 


0.527292 


0.444858 


0.367879 


0.2981 97 


0.236928 


0.1 84520 


0.257917 


0.140858 


0.105399 


0.077305 


0.055576 


0.039164 


0.027052 


0.018316 


0.012155 


0.007907 


0.005042 


0.762435 


0.003151 


0.001930 


0.001159 


0.000682 


0.000394 


0.000223 

W«W WteC*W 


0.000123 


0.000067 


0.000036 


0.00001 € 


-2.08841 


O.000O1O 


0.000005 


0.000002 


0.000001 


0.000001 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-1.451 89 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


O.OOOOOO 


0.000000 


o.oooooo 


0.000000 


o.oooooc 


-0.08087 


O.OOOOOO 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


ft ftftftftftft 

o.oooooo 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0 10876 


0.000000 


o.ooodoo 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooo 


0.000000 


o.oooooc 


-0.55626 


0.000000 


0.000000 


0.000000 


o.oooooo 


0.000000 


o.oooooo 


0.000000 


0.000000 


o.oooooo 


o.oooooc 


0.03139 


0.000000 


0.000000 


o.oooooo 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


ft ftftftftftft 
U.UUUUUU 


n ftftftftftft 
0.000000 


ft ftftftftftr 

o.oooooc 


-0.12116 


0.000006 


0.000000 


0.000000 


0.000000 


. 0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.05413 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.83486 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0,000000 


0.000000 


o.oooooc 


-1.06148 


0.000000 


0.000000 


0.000000 


.0.000000 


o.oooooo 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.69641 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


) -0.01406 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


) -1.04083 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


0.609888 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


) -0.24657 
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Predictor Variables 










Depended 


V1 


V2 


V3 


V4 • V5 V6 


W 


V8 


V9 


V10 


Variable 



0.000000 
0.000000 

o.oooodo 

0.000000 

o.oboobo 

0.000001 

0.000682 

0.055576 

0.612626 

0.913931 

0.184520 

0.005042 

0.000019 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

o.oooodo 

0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000001 
0.000394 
0.039164 
0.527292 
0.960789 



0.000000 
0.000000 

o.ooobod 

0.000000 

dboobob 

0.000002 

0.001159 

0,077305 

0.697676 

0.852144 

0.140858 

0:003151 

0.000010 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

o.booooo 

0.000000 
0.000000 

o.oboooo 

0.000000 
0.000000 
0.000000 
0.000001 
0.000682 
0.055576 
0.612626 
0.913931 



0.000000 
0.000000 
0.000000 
0.000000 

d.odbooo 

0 r .000005 

0.001930 

0.105399 

0.778801 

0.778801 

0.105399 

0.001930 

0.000005 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000002 

0.001159 

0.O77305 

0.697676 

0.852144 



0.000000 

o.oooodo 

0.000000 
0.000000 

d.dooboo 

0.000010 

0.003151 

0.140858 

0.852144 

0.697676 

0.077305 

0.001159 

0.000002 

o.oooodo 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0^000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000005 

0.001930 

0.105399 

0.778801 

0.778801 



0.000000 0.000000 
0.000000 0.000000 



0.000000 
0.000000 

o.odoodd 

0.000019 

0.005042 

0.184520 

0.913931 

0.612625 

0.055576 

0.000682 

0.000001 

0.000000 

0.000000 

0.000000 

o.oodooo 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000000 

0.000010 

0.003151 

0.140858 

0.852144 

0.697676 



o.oooooo 

0.000000 
0.000000 o.oooodo 



0.000000 0.000000 

o.oodooo o.booooo 



0.000036 
0.007907 



0.000067 



0.000000 

o.oooooo 

0.000000 
0.000000 

o.oodooo 

0.000123 



0.236928 0298197 
0.960789 0.990050 



0.012155 0.018316 
0.367879 
1.000000 



0.527292 0.444858 

0.039164 0.027052 

0.000394 0.000223 

0.000001 0.000000 

0.000000 O.OOOOOO 

0.000000 0.000000 



0.000000 
0.000000 



0.000000 
0.000000 



0.000000 0.000000 

0.000000 0.000000 

O.OOOOOO 0.000000 
0.000000 



0.367879 
0.018316 
0.000123 

o.oodooo 

0.000000 
0.000000 
0.000000 
0.000000 

o.oooooo 
o.oooodo 

0.000000 



0.000000 
0.000000 
0.000000 

0;0oooob 

0.000000 
0.000223 
0.027052 
0.444858 
0590050 
0298197 



0.000000 
0.000000 
0.000000 
0.000000. 



0.000000 0.000000 
0.000000 0.000000 



0.000000 
0.000000 
0.000000 



0.000000 0.000000 
0.000000 0.000000 



0.000000 
0.000000 0.000000 
0.000000 0.000000 



0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 0.000000 



0.000019 
0.005042 
0.184520 . 
0.913931 
0.612626 



0.000036 
0.007907 



0.000000 
O.OOOOOO 
O.OOOOOO 
0.000000 

o.ododdi 

0.000394 
0.039164 
0.527292 
0.960789 
0236928 



0.012155 0.007907 
0.000067 0.000036 



0.000000 
0.000000 
0.000067 
0.012155 



0.236928 0298197 
0.960789 0.990050 
0.527292 0.444858 



0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
. 0.000000 
0.000000 

o.oooooo 

0.000000 
0.000000 
0.000000 
0.000000 
0.000123 
0.018316 
0.367879 
1.000000 
0.367879 



0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 

o.oooood 

0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
0.000000 
O.OOOOOQ 
0.000000 

o.oooooo 

0.000000 

o.oooooo 

0.000223 
0.027052 
0.444858 
0.990050 
0.298197 



-0.376 
-1.49206 
-0.17637 
-1.47 
-0.45 
-1.713 
-0.74203 
-0.66797 
-0.36114 
-0.97318 
-2.549 
-2.71749 
-1.373 
-1.624911 
-0.981 01| 
-1.1S 
-3.11507 
-O.31209 
0.237347 
-0.7206 
-0.53267 
-1.144511 
0.323257 
-2.1 31 £ 
1.188074 
-1.18391) 
-127328 
-1.40458 
-0.60408 
-1.35922 
-0.45079 
-1.29463 
0.162774 
-0.53674 
-0.45683 
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V1 


V2 


V3 


V4 


Predictor Variables 

V5 V6 


V7 


• 

V8 


V9 


vie 


Dependent 
Variable 


0.236928 


0.184520 


0.140858 


0.105399 


ft ft7TVW 

u.u# i <yjo 


u.Uooo/b 


0.039164 


0.027052 


0.018316 


0.012155 


0.391253 


0.007907 


0.005042 


0.003151 


0.001930 


ft ftftH^Q 


U.UUUOO£ 


0.000394 


0.000223 


0.000123 


0.000067 


0.117457 


0.000036 


0.000019 


0.000010 


0.000005 


1 ftftfthnn 
1 .uuuuuu 


1 .UUUUUU 


1.000000 


1.000000 


1.000000 


1.000000 


.0.6986? 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftftft 

■ .UUUUUU 


I .UUUUUU 


1 .000000 


1.000000 


1.000000 


i. oooood 


-1.85312 


1,000000- 


1.000000 


1.000000 


1.000000 


1 .ftftftftftft 


1 .UUUUUU 


1 .000000 


1.000000 


1.000000 


1.0OO00C 


-0.0486i 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftftft 
I. UUUUUU 


*i nnnnnn 
1. UUUUUU 


1 .000000 


1.000000 


1.000000 


1.000000 


0.214684 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftftft 

1 .uuuuuu 


1. UUUUUU 


1.000000 


1.000000 


1.000000 


1.000000 


0261316 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftnn 
1 .uuuuuu 


1. UUUUUU . 


1 .000000 


1.000000 


1.000000 


1 .000000 


-0.57448 


1.000000 


1.000000 


1.000000 


1.000000 


1 .UUUUUU 


l .UUUtKJU. 


1.000000 


1.000000 


1.000000 


1.000000 


2.468938 


1.000000 


1.000000 


1.000000 


1.000000 


1 nnnnftft 

1 .UUUUUU 


1. UUUUUU 

% 


1 .000000 


1.000000 


1.000000 


1.000000 


-0.93785 


1.000000 


1.000000 


1.000000 


1.000000 


1 .uuuuuu 


1. UUUUUU 


1 .000000 


-1.000000 


1 000000 


1.000000 


1.165921 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftftft 
I .uuuuuu 


"i ftftnrinn" 

1. UUUUUU 


1.000000 


1.000000 


1.000000 


1.000000 


0.966748 


i.000000 


1.000000 


1.000000 


1.000000 


l.UUUUuu 


1. UUUUUU 


1.000000 


1.000000 


1.000000 


1.00000C 


r 

0.125721 


1.000000 


1.000000 


1.000000 


1.000000 


i .UUUUUU 


1 .UUUUUU 


i.oooooo 


1.000000 


1.000000 


1.000000 


0.867138 


1.000000 


1.000000 


1.000000 


1.000000 


1 .UUUUUU 


i .UUUUUU ?. . 


1.000000 


1.000000 


1.000000 


1 .000000 


0.551458 


1.000000 


1.000000 


1.000000 


1 .000000 


i .UUUUUU 


1 .UUUUUU 


1.000000 


1.000000 


1.000000 


1 .000000 


0287231 


1.000000 


1.000000 


1.000000 


1.000000 


h nftftftftn 

1 .UUUUUU 


a nftnftftft 

1 .UUUUUU 


1.000000 


1.000000 


1.000000 


1.000000 


-0.75881 


1.000000 


1.000000 


1.000000 


1.000000 


-i ftAftftftft 


1 ftftftftftft 

I .UUUUUU 


1.000000 


1.000000 


1.000000 


1.00000C 


0.551283 


1.OD0000 


1.000000 


1.000000 


1.000000 


1 .UUUUUU 


1 ftftftftftft »- 

1 .UUUUUU 


1.000000 


1.000000 


1.000000 


1.00000C 


0.066577 


1.000000 


1.000000 


1.000000 


■ 1.000000 


1 ftftftftftft 

1 .UUUUUU 


1 oooonn 

1 .UUUUUU 


1:000000 


1.000000 


1.000000 


1.00000C 


0;503767 


1.000000 


1 .000000 


1.000000 


1.000000 


1 ftftftftftft 

t .UUUUUU 


1 ftftftftftft *. 

1 .UUUUUU 


1.000000 


1.000000 


1.000000 


1.00000C 


0.067802 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftftft 

1 .UUUUUU 


1 ftftftftftft 

1 .UUUUUU 


1.000000 


1.000000 


1.000000 


1.00000C 


-1.44586 


1.000000 


1.000000 


1.000000 


1.000000 


1 ftftftftftft 

I .UUUUUU 


ft ftftftftftft 

U. UUUUUU 


0.000000 


0.000001 


0.00J3001 


0.000002 


0.884697 


0.O000O5 


0.000010 


0.000019 


0.000036 


0.000067 


0.000123 


0.000223 


0.000394 


0.000682 


0.001 15S 


-0.49601 


0.001930 


0.003151 


0.005042 


0.007907 


0.012155 


0.018316 


0.027052 


0.039164 


0.055576 


0.077305 


-0.24083 


0.105399 


0.140858 


0.184520 


0.236928 


0.298197 


0.367879 


0.444858 


0.527292 


0.612626 


0.69767( 


3 0.027056 


0.778801 


0.852144 


0.913931 


0 960789 


0.990050 


1.000000 


ft QQftft^ft 
U.99UUUU 


ft QKft7RQ 


ft Q1*^Q*^1 


ft ftR914/ 


I ft 1 ft OA A/ 


0.778801 


0.697676 


0.612626 


0.527292 


0.444858 


0.367879 


0.298197 


0.236928 


0.184520 


0.14085* 


3 0.517325 


0.105399 


/ 0.077305 


0.055576 


0.039164 


0.027052 


0.018316 


0,012155 


0.007907 


0.005042 


0.00315' 


I 1.688736 


0.001930 


0.001159 


0.000682 


0.000394 


0.000223 


0.000123 


0.000067 


0.000036 


0.000019 


O.0OO0K 


] 2.813648 


0.000005 


0.000002 


0*000001 


0.000001 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooooox 


D 0.877579 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


. 0.000000 


0.000000 


0.000000 


0.00000( 


3 2.008548 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


O.OOOOOi 


D 1.728837 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooodoo 


0.000000 


0.000000 


0.00000! 


D 1.712295 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.00000 


\ 2.676133 
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V1 


V2 


V3 


V4 


Predictor Variables 

V5 V6 


V7 


• 

V8 


V9 


V It 


P\o non Hon I 

Uy pen u em 


. 0.000000 


0.000000 


0.000000 


O.OOOOOO 


0:000000 


O.OOOOOO 


0 000000 
u. u Uwuuy 


0 fViOftftO 

U.UUUUUU 


ft ftftnftrift 
u.uuuuuu 


n nnrihnh 

U.UUUUUL 




o.oobooo 


0.000000* 


0.000000 


0.000000 


0.000000 


0 000000 

u » v vw u v 


0 000000 

U . U \J\J\J\J\J 


0 ftftftftftft 
u.uuuuuu 


n ftnnnnn 
u.uuuuuu 


u.uuuuuu 


Z.7U417S 


0.000000 


o.oobooo 


0.000000 


0.000000 


0.000000 


0.000000 


Vl.VJ U UUuU 


ft ftftftnnft 
u.uuuuuu 


n nnnnnn 
U.UUUUUU 


u.uuuuuu 


Z.I 71262 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0 000000 


o nnonoo 


ft ftftftftftft 
u.uuuuuu 


n nnnnnn 

U.UUUUUU 


U.UUUUUU 


o nooo4*3 
3.02981 «: 


0.000000 


o.oooooo 


o.oooooo 


0.000000 


0.000000 


0 000000 

U. UUUU UU 


ft ftftnnnn 

U.UUUUUU 


n nnnnnn 
u.uuuuuu 


n nnrxnnn 
U.UUUUUU 


O.OOOOOO 


3.048587 


0.000000 


0.000000 


o.oooooo 


0.000000 


0.000000 


0 000000 


0 ftftftftftft 


n ftftftnnft 
u.uuuuuu 


ft nnnnnn 

U.UUUUUU 


n nnr\nnr\ 
U.UUUOOO 


2.721537 


0.000000 


o.oooooo 


0.000000 


o.oooooo 


0.000000 


o oooboo 

w«w wwww w 


0 oooooo 

u.uuuuuu 


n rtftrinnn 

. U.UUUUUU 


n nnnnnn 
U.UUUUUU 


n nnr*nr\h 
U.UUUOOO 


2.74891 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


n 000000 

U.UUUUUU 


ft ftftnnnn 

U .UUUUUU 


n nnnnnn 

U.UUUUUU 


n nnnnnn 
U.UUUUUU 


1 .08928E 


0.000000 


0.000000 


0.000000 


o.oooooo 


0.000000 


0 000000 

v.WVW UUU 


o ooonhft 

u.uuuuuu 


ft ftftftftftft 
u.uuuuuu 


n ftnnnn-t 

U.UUUUU 1 


n nnnnnn 
U.UUUUU1 


3.1 87072 


0.000002 


0.000005 


0.000010 


0.000019 


0.000036 


0 000067 

U.U WUUU I 


o ftftm 

U.UUU 1 zo 


ft nftft??** 


n nnn^QA 
u.uuuo94 


Q. 000682 


3.197012 


0.001159 


0.001930 


0.003151 


0.005042 


0.007907 


U »U l£ 1 v>o 


0 01R*ilfi 

U.U 1 Uu 1 U 


ft ft97n^9 


u.uoyit>+ 


n n*x<jiT7c 
U.UuODYo 


1 .648355 


6.077305 


0.105399 


0.140858 


0.184520 


0 236928 




ft 1R7R7Q 


n AAAQIZQ 
U.*t*HKJOO 


n CO"7TJQ*5 


U.DI 2526 


1.283097 


0.697676 


0.778801 


0.852144 


0,91 3 93 f 


O 960789 

V/»wwwf w>w 


0 990050 


1 onnftftft 

I .uuuuuu 


u.yyuuou 


u.abu/t>y 


0.913931 


1 .001 207 


0.852144 


0.778801 


0.697676 


0.612626 


0 527292 


0 444858 


ft **R7R7Q 
U.OO/ Of a 


Ui^aony/ 


n O'jeooD 

u.^oazo 


u.1 84520 


2.190649 


0.140858 


0.105399 


0.077305 


0.055576 


0.039164 


0 027052 


ft ftin^iR 

U.U 1 OO 1 D 


U.U 1^1 do 


u.uwyur 


0.005042 


1 .037059 


0.003151 


0.001930 


0.001159 


0.000682 


0.000394 


0 000223 


U.UUU I i-O 


ft ftftftftfi7 
UiUUUUD/ 


n nnnn^cz 

U.UUUUOD 


n nnnn«i o 

u.uuuui y 


0.617336 


0.000010 


0.000005 


0.000002 


0.000001 


0.000001 


0 000000 

w » V ww w w w 


ft ftftftftftft 
u.uuuuuu 


u.uuuuuu 


n nnnnnn 
U.UUUUUU 


U.UQUOOD 


1 .56651 


O.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0 000000 


ft ftftftftftft 

U.UUUUUU 


n nnnnnn 
u.uuuuuu 


n nnnnnn 
U.UUUUUU 


O.OOOOOO 


-0.72404 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0 000000 


ft ftftftftftft 
u.uuuuuu 


n nnnnnn 

U.UUUUUU 


n nnnnnn 
U.UUUUUU 


0.000000 


0.015634 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


O.OOOOOO 


0 OOftftftft 


ft ftftftftftft 


n nnnnnn 

U.UUUUUU 


n nnnnnn 

u.uuuuuu 


■* *>QCC 

-1.z8oc 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o ooonnn 

u.uuuuuu 


ft ftftftftftft 
u.uuuuuu 


ft nnnnnn 
u.uuuuuu 


n nnnnnn 
u.uuuuuu 


0.9474 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooo 


0 000000 

U.UUUUUU 


o onnnftft 

u.uuuuuu 


ft ftnnnftn 

U.UUUUUU 


n nnnnnn 
u.uuuuuu 


n on/ii77 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


ft oooooo 


ft ftftftftftft 

U.UUUUUU 


n ftftftftftn 

U.UUUUUv 


1 nT91R 
-T.U/Z IC 


0.000000 


0:000000 


0.000000 


0.000000 


n nnnnnn 
u.uuuuuu 


n nr\nnr\r\ 


0 000000 

w*v WW WW w 


0 OOftftftft 


ft ftftftftftft 


ft nnnnnn 

U.UUUUUU 


u.zo/y4o 


o.oooooo 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0 000000 


0 ftftftftftft 
u.uuuuuu 


ft ftftftftftft 

U.UUUUUU 


n nnnnnn 

U.UUUUUl 


n oecoc 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


ft oooooo 


0 ftftftftftft 
u.uuuuuu 


ft ftftftftftft 

u.uuuuuu 


ft nnftnnn 
u.uuuuuu 


_n oQC/ic 

-u.yyo**D 


0.000000 


0 000000 


n ftftftnnn 


u.uuuuuu 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooboc 


0.360602 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.72222 


0.000000 


0.000000 


O.OOOOOO 


0^000000 


0.000000 


0.000000 


o.oooopo 


0.000000 


0.000000 


o.odoooc 


0.066804 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


0.379405 


0.000000 


0.000000 


' 0.000000 


0.000000 


0.000000 


0.000000 


o.oooooo 


0.000000 


o.oooooo 


o.oooooc 


0.307738 


0.000000 


0.000000 


0.000000 


0.000000 


0.000001 


0.000001 


0.000002 


0.000005 


0.000010 


0.00001E 


-0.09646 


0.000036 


0.000067 


0.000123 


0.000223 


0.000394 


0.000682 


0.001159 


0.001930 


0.003151 


.0.005045 


-0.59666 


0.007907 


0.012155 


0.018316 


0.027052 


0.039164 


0.055576 


0.077305 


0.105399 


0.140858 


0.18452C 


-0.0747S 


0.236928 


0.298197 


0.367879 


0.444858 


0.527292 


0.612626 


0.697676 


0.778801 


0.852144 


0.913931 


0.366227 
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V1 


V2 


V3 


V4. 


Predictor Variables 

V5 V6 


V7 


V8 


V9 


VIC 


Dependent 
Variable 


0.960789 


0.990050 . 


1.000000 


0.990050 






6.852144 


0.778801 


0.697676 


0.61 262e 


0.146715 


0.527292 


0.444858 


0.367879 


0.298197 




U.7040ZU 


0.140858 


0.105399 


0.077305 


0.055576 


-1.03773 


. 0.039164 


0.027052 


0.018316 


0.012155 


u.uu/au/ 


U.UUDU4Z 


0.003151 


0.001930 


0.001159 


0.000682 


0.43298 


0.000394 


0.000223 


0.000123 


0.000067 


V/.UUvWOO 


u.uuuu iy 


0.000010 


0.000005 


0.000002 


0.000001 


-0.77253 


0.000001 


0.000000 


0.000000 . 


0.000000 


U.UUUUUU 


U.UUUUUU 


0.000000 


0.000000 


0.000000 


0.000000 


-1.59873 


0.000000 


0.000000 


0.000000 


0.000000 


A AAAAAA 

U.UUUUUU 


a aaaAaa 
U.UUUUUU 


0.000000 


0.000000 


0.000000 


o.oooooo 


-0.91667 


0.000000 


0.000000 


0.000000 


0.000000 


r\ />AAAAA 

U.UUUUUU 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-0.18372 


0.000000 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


a aaaaaa 
U.UOOUUU 


0.000000 


0.000000 


0.000000 


0.000000 


0.05454 


0.000000 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-1 29388 


0.000000 


0.000000 


0.000000 


0.000000 


A AAAAAA 
U.UUUUUU 


e\ aaaaaa 
O.U0UUO0 


0.000000 


0.000000 


o.oooooo 


0.000000 


-1.58155 


0.000000 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


U.UUUUUU 


o.oooooo; 


0.000000 


0.000000 


0.000000 


-2.46637 


0.000000 


0.000000 


0.000000 


0.000000 


A AAAAAA 

U.UUUUUU 


A AAT\AArt 

U.UUUUUU 


0.000000 


0.000000 


0.000000 


0.000000 


-2.70749 


0.000000 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


A AAAAA/\ 

U.UU0UU0 


0.000000 


o.oooooo 


0.000000 


0.000000 


-2.9947 


0.000000 


0.000000 


0.000000 


0.000000 


O.OOOOOO 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-2.B5241 


0.000000 


o.ogoooo 


0.000000 


0.000000 


U.UUUUUU 


0.000000 


0.000000 


0.000000 


0.000000 


0 oooooo 


-2.46247 


0.000000 


0.000000 


0.000000 


0.000000 


A aaaaaa 
U.UUUUUU 


0.000000 


0.000000 


0.000000 


0.000000 


o oooooo 


-3 73826 


0.000000 


0.000000 


0.000000 


0.000000 


U.UUUUUU 


A AAAAAA 

0.000000 


0,000000 


0.000000 


0.000000 


o oooooo 


-3 68025 


0.000000 


0.000000 


0.000000 


0.000000 


a aaaaaa 
U.UUUUUU 


A AAA AAA 

O.OOOOOO 


0.000000 


0.000000 


0.000000 


0.000000 


-3.57917 


0.000000 


0.000000 


0.000000 


0,000000 


a nnnnnn 
U.UUUUUU 


A AAAAOA 

0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-3.69291 


0.000000 


0.000000 


0.000000. 


0.000000 


a aaaaaa 
U.UUUUUU 


A AAAAA-1 

U.UUUUU1 


0.000001 


'0.000002 


0.000005 


0.000010 


-3.69046 


0.000019 


0.000036 


0.000067 


0.000123 


U.UUUZ-co 


A AAAOOv* 

u.uuu»iy4 


0.000682 


0.001159 


0.001930 


0.003151 


-2.85299 


0.005042 


0.007907 


0.012155 


0.018316 


U.U^/UDz 


A A^O-ICrf 


0.055576 


0.077305 


0.105399 


0.140856 


-4.54066 


0.184520 


0.236928 


0.298197 


0.367879 


u. h***+ooo 


A 

KJ.O/J /CdZ 


0.612626 


0.697676 


0.778801 


0.852144 


-3.46635 


0.913931 


0.960789 


0.990050 


1.000000 


U.yiA/UDU 


a oen7Dn 


0.913931 


0.852144 


0.778801 


0.69767€ 


-2.3129 


0.612626 


0.527292 


0.444858 


0.367879 


0.298197 


0.236928 


0.184520 


0.140858 


0.105399 


0.07730£ 


-1 .909 


0.055576 


0.039164 


0.027052 


0.018316 


0.012155 


0.007907 


6.005042 


0.003151 


0.001930 


0.001 15S 


-1.38891 


0.000682 


0.000394 


0.000223 


0.000123 


0.000067 


0.000036 


0.000019 


0.000010 


0.000005 


0.000005 


-1 .70557 


0.000001 


0.000001 


0 oooooo 


0 OOOOOO 


0.000000 


0.000000 


A AAAAAA 

U.UUUUUU 


A AAAAAA 
U.UUUUUU 


0.000000 


o.oooooc 


-2.08043 


0.000000 


0.000000 


0.000000 


. 0.000000 


0.000000 


0.000000 


0.000000 


. 0.000000 


o.oooooo 


o.oooooc 


-1.34632 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-1.84107 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-1.83476 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.86864 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.65575 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


] 0.340095 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


I -1.0628 
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V1 


V2 


V3 


V4 


Predictor Variables 

V5 V6 


V7 


V8 


V9 


V10 


Dependent 
Variable 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


1.159148 


0.000000 


0.000000 


0.000000 


0.000000 


O'OOOOOO 


o.oobooo 


0.000000 


0.000000 


0.000000 


o.oooooc 


1.126134 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


, 0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-0.15135 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.ooooob 


0.000000 


0.000000 


-0.57837 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.GO000O 


0.000000 


0.000000 


0.000000 


0.000000 


1.241583 


0.000000 


0.000000 


0.000000 


0.000000. . 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.pooooo 


1.165105 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-0.2 


0.000000 


0.000000 


o.oooodo 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0. 000000 


o.ooooob 


-0.23279 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


1.092664 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


-0.52286 


0.000000 


0.000000 


- 0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


1.348673 


0.000000 


0.000000. 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


0.000000 


o.oooooc 


-0.22802 


0.000000 


0.000001 


0.000001 


0.000002 


0.000005 


0.000010 


0.000019 


0.000036 


0.000067 


0.000123 


0.260234 


0.000223 


0.000394 


0.000682 


0.001159 


0.001930 


0.003151 


0.005042 


0.007907 


0.012155 


0.018316 


1.027092 


0.027052 


0.039164 


0.055576 


0.077305 


0.105399 


0.140858 


0.184520 


0.236928 


0.298197 


0.36787S 


0.585037 


0.444858 


0.527292 


0.612626 


0.697676 


0.778801 


0.852144 


0.913931 


0.960789 


0.990050 


1.00000C 


-0.5898 


0.990050 


0.960789 


0.913931 


0.852144 


0.778801 


0.697676 


0.612626 


0.527292 


0.444858 


0.36787S 


0.439033 


0.298197 


0.236928 


0.184520 


0.140858 


0.105399 


0.077305 


0.055576 


0.039164 


0.027052 


0.01831* 


3| -0.42928 



Table 8: Gene Expression Data and Survival for 50 Genes 
from Alizadeh et al 



Patient 


Survival 

Time Outcome X1 554 X1 639 X1 777 X1 876 X1 9 08 X1 940 X2045 X2208 X2339 X2383 X2395 X2430 X2491 


V32 


1.3 


1 0.270-0.730-0.100-0.080 0.570-0.510 0.520 1.830 0.500 0.110-0.630-0.250 1.940 


V17 


2.4 


1-0.170-0.480-0.560-0.470-0.350 0.860 0.830 2.320-0.080 0.770-0.740-0.230 0.220 


V18 


2.9 


1 0.040-0.010-1.110-0.880-0.540-0.340 0.380 2.730 0.580 0.300-0.580 0.120 1.390 


V6 


3.2 


1-0.300 0.020-0.440-0.300-0.220-0.140 1.430 0.640 0.100 0.370-0.480-0.530 0.110 


V2 


3.4 


1 -0.050-0.096-0.700-0.390-0.140-0.140 0.540-0.230 0.090 1.130 0.090 0.000 0.480 


V12 


4.1 


1-0.050-0.200 0.570-0.190 0.360 0.400 0.040 0.000-0.120-0.190 0.270 0.090-0.530 


V20 


4.6 


1-0.36O 1.100-0.620-0.520-0.160 -0.290 0.570 0.900 0.460 0.1Q0 -0.520 0.000-0.180 


V25 


5.1 


1 -0.010-0.300-0.410-1.070 0.160-0.410 0.620 1.860 0.020 0.500 0.050 0.050 1.720 
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Survival 

Patient Time Outcome X1 554X1 639X1 777X1 876X1908X1 940X2045X2208 X2339 X2383X2395 X2430 X2491 



721 


8.2 


1 -0.810-0^95 -1 .190 r0.434 -0.330 -0.640 -0.470 -0.480 -0.250 -0.209 -0.148 -0.290 0.220 


77 


8.3 


1-0.230-0.750 0.320-0.260 0.220-0.270 0.750 0.540 0.440-0.620 0.260-0.050 0.640 


739 


9.5 


1-0.250 0.370-0.340 0.980-0.380-0.370-3.210 2.470 0.930-0.220-0.770 0.650 ^).920 


724 


11.8 


1 0.000-0.140-0.460-0.340-0.170 0.000 0.310 0.630-0.020 0.000 0.270 0.660 1.000 


729 


12.3 


1 0.390-0.360 0.120-0.240 0.180 0.890-1.180 0.600-0.080-0.870 0.480-0.310 0.930 


733 


12.7 


1 0.260 0.060-0.040 0.150 0.190 0.800-0.310 0.290 0.170-0.170-6.370-0.130 0.260 


716 . 


15.5 


1 -0.290-0.210-0.660-0.980-0.030 0.340 0.350-0.620 0.630 0.380-0.160 0.280 0.080 


740 


22.3 


1 -0.150-0.380-0.080-0.500-0.090-0.220-0.760 1.670-0.496-0.009 0.550-0.400-0.290^ 


713 


23.7 


1-0.920 -0.460 -1.1 50 -0.380 -0.440 0^60-6.660 0.700 0.790 0.220 0.000-0.270 0.070 


711 


27.1 


1 0.060-1.620-0.590-0.340-0.080-1.300 0.890 1.450-0.080-0.025-1.080-0.560 1.620 


737 


31.5 


1-0.090 0.050-0.290-0.230 0.070-0.820 0.740 1.190-0.340 0.370 0.760 0.610 0.640 


723 


32.5 


1-0.380 0.060-0.150-0.570 0.120-0.470-0.530 0.250-0.480-0.390 0.390 0.320 0.700 


738 


39.6 


1 -0.145 0.060 0.340-0.270 1.240 0.280-1.320-2.580 0.180-0.040 0.510-0.040-1.570 


75 


51.2 


0-0.070-0.380-0.080 0.000 0.130 0.050 0.190 0.130 0.220-0.760 0.190-0.110 0.190 


736 


53.7 


0-0.200 -0.410 -0.320 -0.430 -0.600 0.240 2.170 0.350 0.310-0.050-0.320 0.370-0.090 


715 


56.6 


0-0.820 0.160-0.040-1.250 0.500-0.550-0.380-0.010-0.760 0.460 0.110 0.200 0.790 


714 


59.0 


0-0.340-0.043-0.700-0.056 0.198 1.000 0.630 0.290 0.000 0.298-0.090 0.120-0.060 


731 


68.8 


0 0.080-0.140 0.200 0.110-0.080 0.160 0.230 1.330 0.290-0.250-0.050-0.050 0.430 


730 


69.1 


0 0.380 0.720 0.700 0.390 0.000 -0.580 -0.670 -0.480 -0.61 oi.640 0.440 1.630 0.590 


74 


69.6 


0-0.060-0.570-0.380-0.830 0.060-0.010 0.470 1.230 0.010-0.060 0.210-0.090 0.150 


^3 


71.3 


1 -0.400 -0.280 -0.390 -0.490 -0.040 -0.080 1.640 2.110 0.220-0.100 0.210 0.010-0.700 


728 


71.3 


ft ft 7flft ft IRfi ft inn n icn n nn n Of?n n aczn a to a r\ oar* *t oaa a coa a tAn a~ota 
u u./uu u. lull U.1UU U.loU U.OIU U.^OU U.oDU [J.f -U.zyO -1 .^UU U.OoU -U.14-U U.o/v 


734 


72.0 


0-0.940-0.050-0.060-0.240-0.070 0.440-1.500 0.170 0.090 0.147-0.050 0.340 3.630 


71 


77.4 


0 0.000 0.530 0.980 0.380 0.910 1.080 3.210 2.580 -0.220-0.140 -0.870 0.100-0.970 


719 . 


80.4 


0 -0.1 90 -0.340 -0.950 -0.430 -0.410 0.150 -0.1 10 -0.800 -0.050 -0.870 -0.780 -0.100 0.170 


727 


83.8 


0 0.280 0.000-1.110-1.040-0.570-0.150 0.720 1.080 0.070 0.220-0.180 0.620 0.000 


710 


88.1 


0 0.690-0.370 0.000 0.130 0.060-0.160 0.670-0.910 0.120 -0.130 -0.590 0.220-0.600 


79 


89.8 


0-0.220-0.700-0.790-0.340-0.260 0.190-1.130-1.960-0.150 0.490 0.540 0.320 0.710 


726 


90.2 


0-0.200 0.420-0.270 0.240 0.470-0.670-0.280 0.380-0.890 0.850 0.000-0.210-0.440 


735 


91.3 


0-0.560-0.660-0.810-0.530-0.250-0.250 0.720-1.160 0.230 0.090-0.350 0.940 0.720 


78 


102.4 


0 0.260-0.800 0.420 0.260 -O.030 0.450-0.870 0.550 0.380 0.280 0.120 0.170 0.260 


722 


129.9 


0-0.680-0.120-0.210-0.460 0.690-0.390 0.070 0.490-0.670 0.910-0.220-0.150 1.200 



WO 03/034270 



104 



PCT/AU02/01417 



Patient 


Survival 

j ■ 
Time Outcome X2544X2640 X2824X2882 X2922X3041 X3138X3171 X3249 X3346 X3494 X4021 


V32 


1.3 


1-0.110 0.280-0.120 0.560 1.440-0.300 0.440 2^70 -0.230 -0.150 0.300 0.096 


V17 


2.4 


1 0.050^0,150 0.050-0.260 0.610 0.050 0.070 1.490-0.370 0.060 0.180 0.010 


V18 


2.9 


1 0.380 2.030 1.740 1.050 0.080 1.480 0.840 0.000 0.570 1.110 0.035-0.620 


V6 


3.2 


1-0.480-0.180-0.220-0.270^0.770 0.180 -0.410 1.520-0.820 0.670 0.040-0.190 


S/2 


3.4 


1 -0.120-0.260-0.510 0.240-0.250 0.220 0.290 0.670 0.340-0.280-0.080 0.240 


V12. 


4 - 1 


1-0.210-0.160-0.960-0:450 0.620-0.330 0.210 0.860^0.340-0.310-0.310 1.640 


V20 


4.6 


1 0.010-0.120-0.480-0.130 0.400 -O.060 -0.070 0.470-0.250-0.090 0.040-0.160 


V25 


5.1 


1-0.520-0.530 0.070-1.010 -2.590-0.480-0.030 2.010-0.990 0.810 0.890-0:180 


V21 


8.2 


1 0.150-0.100 0.250 0.400-0.340 0.820 0.520 0.990-0.120 0.600-0.440 0.610 


V7 


8.3 


1-0.250-0.090 0.060 0.240-0.260-0.230 0.130 1.040 0.440-0.440-0.200 0.030 


V39 


9.5 


1-0.060 0.630 0.510 0.890 0.410 0.280 0.000 1.040 0.380-1.210 0.630 0.290 


V24 


11.8 


1-0.660 0.350 0.140 -0.140 rO.630 0.070-0.350 1.280 0.260 0.420 0.310-0.030 


V29 


12.3 


1 0.010 0.030 0.070 0.000 0.530 0.090 0.240 1.360-0.240-0.050 0.000 0.100 


V33 


12.7 


1 0.120-0.250 0.040-0.220 0.720 0.110-0.100 1.540-0.270 0.500-0.260 O.550 


V16 


15.5 


1-0.290-0.510 0.570 0.140-1.020-0.080.-0.030 0.090 0.360 0.460 0.190-0.300 


V40 


22.3 


1-0.130 0.010 0.150 0.590 0.970 0.440-0.160 1.480 0.530-0.280-0.450 1.080 


V13 


23.7 • 


1-0.470 0.120 0.590 0.260 0.570 0.550 0.133-0.950-0.280 0.010 0.140-0.007 




27.1 


1 0.560 0.700 0.400 0.780 0.040 0.010 0.330 -0.130 0.230-0:070 0.980-0.560 


V37 


31.5 


1-0.130 1.090 0.880 0.910 0.060 0.350 0.170 0.750 -0.200 -0.460 -0.470 -0. 340 


V/23 


32.5 


1-0.290 0.000 0.360-0.560-1.180 0.020 0.170 0.710-0.740 0.330 0.180 -0.300 


V38 


39.6 


1 0.040 0.220 0.880-0.250-1.150 0.380 0.250-0.160-0.410-0.110 0.380-0.620 


V5 


51.2 


0-0.051 0.460 0.210 0.390 0.260 0.280 0.270 1.830 0.030-0.040-0.700-0.080 


V36 


53.7 


0 0.110 0.140-0.310-0.160 0.040-0.160-0.150-0.520-0.180 0.130 0.100-1.410 


V15 


56.6 


0-0.460 0.160 0.030-0.340 1.000 0.170 0.360-0.820-0.310 0.420 0.280 O.240 


V14 


59.0 


0 0.110-0.250-0.150 0.040 0.030 0.180 0.040-0.470-0.280-0.050 0.268 -O.310 


V31 


68.8 


0-0.140 0.010 0.000 0.350 0.890-0.010 0.000 0.880-0.170-0.090-0.400 0.060 


V30 


69.1 


0-0.670 0.240 0.450 1.240 1.180 -0.110 -0.080 -0.080 -0.210 0.010 -0.790 0.280 


V4 


69.6 


0 0.130 0.380 0.480 0.450 0.920 0.250 0.100-0.790 0.420 0.290 -0.300 -0.090 


V3 


71.3 


1-0.070 0.470-0.090 0.140 -0.320 0.250 0.790 0.040 0.100 0.010 -0.890 -0.610 


V28 


71.3 


6-0.580 -0.240-0.010-0.050 0.200 0.080-0.210 0.770 0.000-0.010 0.060 0.090 


V34 


72.0 


0-0.280 0.120 0.690 0.120 -0.080 0.260 0.210 2.460 0.220-0.440 0.170-0.640 


\/1 


77.4 


0-0.110 0.130-1.610-1.150 0.550-0.420 0.230-1.190-0.010 0.180-0.390-0.810 


V19 


80.4 


0 0.000 0.360 0.890 0.710 0.530 0.370 0.230 0.440 0.510 0.710-0.590-0.540 


\/27 


83.8 


0-0.020 0.590 0.580 0.630 0.310 0.160-0.070 0.700-0.240 0.380 0.140 -0.500 
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Patient 


Survival 

. TimeOutcomeX2544X2S40X2824X2882 X2922X3041 X3138X3171 X3249 X3346 X3494 X4021 


v*10 


88.1 


0 0.080-0.590-0.570-0.410-0.100-0.290-0.490-0.270-0.730 0.320-0.180-0-330 


V9 


89.8 


0 0.110 0.250 0.600 0;760 -0.020 0.770 0.300 -0.610 0/750 -0.350 -0.300 0.640 


V26 


90.2 


0-0.140-0.250 0.120-0.380-0.160 OiOl 04)j&30 0.170-0.540 0.520 0.840 0.730 


v*35 


91.3 


0-0.550 1.050 1.290 0.700 0.180 0.290 0^790 0.450-0.660 0.090-0.380-0.180 


v*8 


102.4 


0 0.200 0.340 1.620 1.350 1.550 0,210 0.440 1.050 0:030-0.530-0.090 0.490 


v*22 


129.9 


0-0.210-0.330 0.300-0.210-0.190-0.200-0.160 1.630-0.540 0.000 0.500-0.030 



Gene 



Survival 



Patient 


Time Outcome X9 X206 


X234 X281 X286 X388 X396 


X456 X482 X690 X827 X 1075 X1 098 


V32 


1.3 


1 -1.110-0.150 


0.920 0.000 0.520 -0.140 -0.440 


0.250 0.510-0.260 -0.120 1.340 -0.710 


V17 


2.4 


1 -0.520 -0.100 


1.580 0.580 0.270-0.040-0.040 


1.040 0.170-0.860 0.260-1.32O 0.290 


V18 


2.9 


1 0.350 1.000 


0.010 3.840 0.450 0.880 0.640 


0.510-0.740-0.890-1.080-0.160 0.130 


v*6 


3.2 


1 0.390 0.130 


-0.140 -0.830-0.330 0.430-0.580 


0.030 0.120-0.700 -0.880 0.96O 0.200 


S/2 


3.4 


1 0.110 0.100 


2.100 0.370 -O.090 0.690 0.520 


0.530 0.170 0.660-0.360 0.910-0.013 


V12 


4.1 


1 -1 .020 -0.070 


0.810-0.010 0.310 0.440 0.850 


0.330 -0.020 0.380 0.160 -0.790 -0.350 


V20 


4.6 


1 -0.070 0.030 


1,460 0.670 0.060 0.560 0.800 


0.270 0.150 -0.910 -0550 -0.410 -0.140 


v*25 


5.1 


1 0.410 0.230 


0.340 0.050 -0.910 -0.300 -0.820 


0.350 -0.460 -0.330 -1.810 0.8B0 -0.800 


V21 


8.2 


1 -0.690 -0.060 


1.140 1.800 0.270 -1.190-1.420 


0.090 -0.350 -0.405 -0.400 -0.290 1.910 


vY 


8.3 . 


1 -0.380 -0.110 


0.190 0.110 0.000 0.210-0.110 


0.300-0.070 0.630 0.030-1.280-0.160 


V39 


9.5 


1 -0.490 0.340-0.420 0.650 0.970 1.170 1.030 


0.530 -1.200 1.330 0.000-0.950-0.060 


\/24 


11.8 


1 0.460 0.330 


0.080 1.770 0.950 -0.100 -0.040 -0.090 -0.140 -0.470 -0.210 -0.900 -0.040 


V29 


12.3 


1 0.250 0.050 


0.520 -1.160-0.420 0.180 0.510 0.090 -0.100 0.580 0.360-0.140-1.180 


V33 


12.7 


1 -1.150 0.000 


1.340 0.590 0.400-2.140-1.880 


0.410 0.020-0.960-0.140-1.710-0.460 


V16 


15.5 


1 0.280-0.140 


0.500 -2.220 -1.220 -0.200 -0.910 


-0.340-0.610-1.010 0.160 0.190-0.680 


V40 


22.3 


1 0.100-0.090 


0.120 -0.030 -1.080 -0.600 -0.750 


0.450-0.160-0.570 0.130 0.210-0.243 


v*13 


23.7 


1 0.070 0.180 


0.850 1.270 0.200 0.570 -1.170 -0.330 0.090 -1.390 -0.340 -1.010 -0.030 


V11 


27.1 


1 -0.340 0.260 


-0.110-0.850-0.740 0.440-1.750-0.640 0.900-0.680-0.790 0.280 -0.560 


V37 


31.5 


1 0.380 0.310 


0.630 2.760 1.750 0.220-0.310 


0.030 0.100-0.120 0.140 -1.220-0.030 


v*23 


32.5 


1 0.230 -0.180 -0.040 -O.640 -1.100 -0.990 -0.980 -0.270 -0.690 -0.790 -0.580 -0.850 -0.220 


V38 


39.6 


1 0.220 0.050 0.000-2.080-0.930 0.630-1.590-0.290 0.000-1.280 0.540 0.570-0.070 


v*5 


51.2 


0 0.150 0.050 


-2.480 1.260 0.680 0.730 1.020 


0.220 0.480 0.020 0.340-0.780 -0.210 


V36 


53.7 


0 0.370 0.430 0.600 0.700 0.580 1.510 0.540 0.130 0.320 0.380-0.260-0.430-0.100 


S/15 


56.6 


0 0.880 0.440 


-0.050 1.350 0.560-2.360-1.070 


0.300-0.300 0.320 0.450-1.580-0.040 


V14 


59.0 


0 -0.450 -O.190 


0.020 0.010-0.460-1.300 0.020 0.100-0.350 0.820-0.280-0.730 0^071 
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Gene 

Survival 



Patient 


Time Outcome X9 X206 X234 X281 X286 X388 X396 X456 X482 X690 X827 X1075 X1098 


v*31 


68.8 


0 -1.090-0.210 0.450-0.090 0.170 0.400 0.640 0.470 -0.330 -0.8 10 0.080-0.510-0.420 


VZO 


69.1 


0 -0;61 0-0.130 -0.180 0.390 0.110 0.030 0.720 0.070-0.290 0.000 0.730-1.140 0.180 


V4 


69.6 


_ 0 0.100 0.360 -O.690 0.590 -0.120 0.280-0,280-0.090 0.350-0.100-0.130 0.180-0,110 


v*3 


71.3 


1 -0.250 0.390-0.150-0.250-0.470-1.630 0.350 0.360 0.560 0.730-0.290-1.060 0.080 


V2Q 


71.3 


0-0.160 0.000 0.290 0.160 0.260 0.130 0.400 0.040-0.500-0.550 0.190-1.530-0.490 


v*34 


72.0 


0-0.400-0.100-0.210 0.490 0.460 0.500 -0.260 -0.360 -0.270 r 1. 580 -0.890 -0.870 0.850 


VI 


77.4 


0 0.390-0.990-1.750-2.460-0.127 -1.240-1540-1.190 0.380-1.060 0.140-0.980 0.660 


V19 


80.4 


0 0.000 0.520 0.550-0.230-0.490 0.520-0.440-0.100 0.460 0.680-0.410 0.730-0.310 


V27 


83.8 


0-0.660 0.540 0.490 1.890 0.800 0.110 0.320-0.210-0.440-1.340-1.390-0.090-0.490 


V10 


88.1 


0 -0.260 -0.230 -0.670 -0.490 0.030 0.200 0.000 0.370 0.330 0.660-0.090 0.520 0.140 


V9 


89.8 


0 0.170-0.280 0.540 -0.270 -0!440 0.100-0.320-0.040 0.760-1.430-0540 0.980-0.446 


V2S 


90.2 


0 -0.030 -0.350 -0.070 -0.870 -0.610 -0.660 -0.170 -0.380 -0.320 -0.640 -0.380 -1.310 -0.146 


v*35 


91.3 


0 0.750 0.220-1.840 0.040 0.540 0.810 0.440 0.430 0.370-0.760-0.530 0.760-0.270 


VB 


102.4 


0-0.300 0.020-0.590 0.370 0.160-1.390 1.140 0.090 0.110 0.040-0.030 0.040-0.050 


v*22 


129.9 


0-0.110 -0.190 0.040-0.370-0.810-0.250 0.000 -0.190 -1.200 -0.500 -1.000 0 370-0 1.™ 



Patient 


Survival 

Time Outcome X1 100 X1108 X1130 X1135 X1182 X1202 X124* yi^Yizo-i y^ai yim«; 


v*32 


1.3 


1 -0.150 0.000-1.070-0.050 1.420 0.010-1.400 0.110 0.310 -0.470 0.080 0.630 


v*17 


2.4 


1 -0.460-0.010 0.120 0.690 1.740 0.260-2.750 1.260 0.440-0.380 0.150 0.130 


V18 


2.9 


1 -1.120 0.030-0.410-0.210 0.344-0.390-2.140-1.360 0.050 0.560 0.100-0.340 


y/6 


3.2 


1 0.280-0.150 0.400 0.010-0.120 0.090-1.360-1.040 0.104-0.550 0.420-0.050 


V2 


3.4 


1 0.004 -0.210-0.240 0.140 0.430 0.580 0.060 1.380 0.230 0.470-0.370 0.340 


S/12 


4.1 


1 -0.070 -0;31 0-0.250 0.520 0.080 0.950-0.690 0.380-0.330-0.620 0.110 0.310 


V20 


4.6 


1 -0.780 0.180 1.120 1.240 0.900 0.170^0.070 1.680 0.080 -0.470 -0.170 0.100 


V25 


5.1 


1 0.920 0.270 0.300 0.730 0.910-0.250-0.360-0.030 0.910-0.030-0500 -0.720 


V21 


8.2 


1 -0.145 -0.580 -1.190 -0.630 -0.590 -0.006 -0.224 0.205 0.190 0.130 -0.110 -1.220 


v7 


8.3 


1 -0.310 0.090 0.350 0.460 0.480 0.160 0.320 0.630-0.170-0.010 0.060 0.250 


v*39 


9.5 


1 -0.290-0.040 1.170 0.800 -0.570 -0.050 -0.166 -0.840 -0.720 -0.120 -0.150 0.160 


v54 


11.8 


1 0.110 0.040 0.890 1510-0.090-0.620 0.030 0.190 0.140 0.130-0.190 0.080 


v59 


12.3 


1 -0550-0.070-0.280 0.240 0.090 0 540 0.950 1.420 0.160-0.340 0.000 -0.060 


v-33 


12.7 


1 0.080-0.110-1.560-0.410 0590-0540-0.150 2.090 0.480 0.340 -0.730 0.070 


v*16 


15.5 


1 -0.120 0500-0,600-0.430 1.560 -0.140-0.970 0.970 0.570 0.240-0.110 -0.480 


V4Q 


22.3 


1 -0.310-0520 0.140 0.110 0.220 0.790-0.050 -0510-0.220-0.960 0.070 -0.330 
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Survival 




Patient 


Time Outcome X1100 X1108 X1130 X1 135X1 182X1 202X1 245X1 341 X1350X1421 X1441 X1535 


V13 


23.7 


1 0.130 0.020 0.460 0.230 0.380 -0.570 -0.370 -1 .550 0.480 0.070 0.090 -0. 160 


tfil 


27.1 


1 -0.950 -0.370 -0.960 -0.110 0.390 0.120 -1.170 1.230 0.530 -0.100 -0.250 -0.420 


V37 


31.5 


1 -0.160 -1.070 -1300 -0.690 -0.500 -0.190 0.370 -0.360 0.119 0.120 -0.150 0.630 


V23 


32.5 


1 0.180-0.030 0.530 0.700-0.010-0.540 1.480-0.870 0.280 0.270-0.220-0.310 


V3Q 


39.6 


1 0.160 0.400 0.760 0.450 0.440 -0.510 0.150-2350 0.140 0.320-0.570 0.010 


V5 


51.2 


0 -0.070 -0.380 -0.730 -0.070 -0.090 0.060 0.890 -0.410 0.375 0.050 -0.120 0.37Q 


V36 


53.7 


0-0.380-0.260 0.410 0.620-0.560-0.010-0.240 0.330 0.130-0.360 0.400 0.080 


V15 


56.6 


0 0.200-0.180-1.060-0.590-0.230-0.900-0.620 0.223-0.060-0.540 0.100-0.210 


V14 


59:0 


0 0.250 0.550 -0.087 0.373 0.510 0.190 0.110 0:334-0.010 0.990 0.500-0.150 


V31 


68.8 


0 0.200-0.110-0.680 0.000 0.210 0.280-0.140 1330 0.130 -0.460 -0.190 0.080 


V30 


69,1 


0 0.380-0.410 0.010 0.500 0.000 0.010 0.960-0.320-0340-0.530 0.180 0.410 


V4 


69,6 


0-0.340-0.380-1.040-0.220-0.350 0.480 0.860 0.680 0320 0.330-0300 0.170 


V3 


71.3 


1 -0.360 0:170-0.450 0.070-0.110 0.110-1.220 0.150 0.190-0.270 0.440 0.100 


V28 


71.3 


0 0.210-0.100 0.130 0.440 0.160 0.390 1.170 0,790 0.200 0.110 -0360 0.130 


S/34 


72.0 


0 0.510 0.610-0.150 0.320 0.500 0.430 0.760-0.340 0.270 0.150-0.460 0.220 


V1 


77.4 


0 0.120-0.430-0.990-0.170 0.600 0 320-1.590 0.300 -0.370 -0.250 -0.180 -0.610 


^19 


80.4 


0-0.170 0.300-0.390 0.170 0.280 -0.950 -0.370 0.550 0.780 0.410-0.350-0.570 


V27 


83.8 


0 0.300-0.010-0.240 0.000 -0.930 -0.530 -0.410 0.170 0.470 0.420 0.020-1.430 


V10 


88.1 


0-0.080-0.050 0.170 0.040 0.560 0.670-0.070-0.060-0300-0.250 0.110 0.010 


\/9 


89.8 


0 0.000-0.240-1.080-0.410 0.270 0.220 0.680 1.740 0.130 0.190-0.580-0.050 


V26 


90.2 


0-0.041 0.360 0.190 0.310 0.660 0.070 2.990-0.370 0 370-1.000 0.080 0.000 


y/35 


■ 91.3 


0-0.240-0.370-0.800-0.350 0.900 0.190 0.920 0.650 0.137 0.740 0.070-0.110 


V8 


102.4 


0-0.550-0.540 0.210 0.450 -0.480 0.630 1.450 0.080-1.080-0.040-0.030 0.000 


V22 


129.9 


0 0.350-0.030-0.400-0.100 0.800 0.220 0.890-0.400 0.380-0.140-0.030-0.310 



Example 6 : Lymphoma Survival Analysis 
This example uses real survival data from 
http : //llmpp . nih . gov/lymphoma/data . shtml 

The companion article is Alizadeh AA, et al . (2000) 
Distinct types of diffuse large B-cell lymphoma 
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identified by gene expression profiling. Nature 
403 (6769) ;503-ll 

The data is microarray data consisting of data for 402 6 
genes and 4 0 samples (individuals) with survival times 
and censor indicator available for each sample. The 
results were analysed using the algorithm, implementing a 
Cox's proportional hazards model . 

Note that the algorithm has selected 3 genes as' being 
associated with survival time (gene: 3797X, 3302X, 356X) . 

Example 7 : Reduced Lymphoma Survival Analysis 

For completeness of documentation, we also present an 
example based on a subset of the genes from Alizadeh et 
al . 50 genes were selected, including 47 chosen at random 
and 3 genes identified as significant in the analysis of 
the full data set. The data are shown in the following 
table 9, which gives gene expression (for the reduced set 
of 50 genes), and survival for each patient. 

The data were analysed using the version of the algorithm 
containing Cox' s proportional hazard survival model . 
After 22 iterations, five genes were selected, including 
2 genes from the solution for the full set . The full 
results (including an iteration history) are given below: 

*************************************** *********** 
EM Iteration: 0 expected post: 2 

***************************** ************ ********* 
Number of basis functions 50 

*************************** *********************** 

EM Iteration: 1 expected post: -56.01952 87084271 
************************************************** 

Number of basis functions 5 0 
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********************************************** **** 

EM Iteration: 2 expected post : -54 . 947811363042 
************** ******** ****** ****** ******** ******** 

Number of basis functions 37 

************************************************** 

EM Iteration: 3 expected post: -54.3317631914479 
************************************************** 

Number of basis functions 21 

********** * * ************************************** 

EM Iteration: 4 expected post: -54.0607159790051 
******* * * ***************************************** 

Number of basis functions 13 

************************************************** 

EM Iteration: 5 expected post: -53.7980836894172 
************************************************** 

Number of basis functions 10 
ID(s) of the variable (s) left in model 
3 4 14 16 17 20 25 33 43 50 
regression coefficients 

1.30171200916394 1 . 48405 8 10198456e- 005 -0.4917995064 81601 
0.688155245054059 5 . 82517870544154e- 007 -1.1317225599503 6 
2 . 95075622492565e-008 0.0003 017216998 57512 
-0 .748378 079168 908 1.277573 0496471 

************************************************** 

************************************************** 

EM Iteration: 6 expected post: -53.5560385409619 
************************************************** 

Number of basis functions 8 

ID(s) of .the variable (s) lef t in model 
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3 4 14. 16 20 33 43 50 . 
regression coefficients 

1.30.877141820174 1 . 114 97455349489e- 009 -0.4409346733 586 09 
0.731610034191797 -1.15246816508172 8 . 103 91 142 8 99109e- 007 
-0.7367529-26831824 1.29017005214433 

************************************************** 

************************************************** 

EM Iteration: 7 expected post: -53.4357726710363 
******************************************* * * * * * * * 

Number of basis functions 6 - 
ID(s) of the variable (s) left in model 

3 14 16 20 43 50 ; 
regression coefficients 

1 .309814416693 8 3 -0 .37735076074 52 59 0 . 7510 652 94 832 6 91 
-1.16718699172136 -0.722720884604726 1.29171119706608 

************************************************** 

******** ****************************************** 

EM Iteration: 8 expected post: -53.4338660629788 
************************************************** 

Number of basis functions 6 

ID(s) of the variable (s) left in model 

3 14 16 20 43 50 

regression coefficients 

1.30685231664004 -0.2 97229338 84524 0.758547724 825121 
-1.17959350866281 -0.703886124955911 1.2848752 8071873 

*^* ************** ********************************** 

************************************************** 
EM Iteration: 9 expected post : -53.5154485460488 
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******** *************** ************************ *** 

Number. of basis functions 6 

ID(s) of the variable (s) left in model 

3 14 16 20 43 50 

regression coefficients 

1.30125961104666 -0.19 9901821555315 0.76063 9983 868042 
-1.19192749808285 -0.679917691918485 1.272423 3 5041331 

************************************************** 

************************** ************************ 

EM Iteration: . 10 expected post: -53.6545745873571 
************************************************** 

Number of basis functions 6 

ID(s) of the variable (s) left in model 

3 14 16 20 43 50 

regression coefficients 

1.29433188361771 -0.097610630 9061782 0.7604 91979596701 
-1 . 2 0394 67232 9711 -0 . 6 53272 803573 524 1 . 2 5725914248418 

************************************************** 

************************************************ * * 

EM Iteration: 11 expected post: -53.820846021012 
************************************************** 

Number of basis functions 6 

ID(s) of the variable (s) left in model 

3 14 16 20 43 50 

regression coefficients 

1.28789874198243 -0.0244121499875095 0.7596 81966852181 
-1 . 21216 9 63682 011 -0.63 0795741658714 1 . 2435 07087842 12 

************************************************** 
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************************************************** 
EM Iteration: 12 expected post: -53.9601661781558 

-aAr -aAr -Ar -aAr -afc- -sfc- T*r -afc-^Ar-aAr-A- -*-T*r"*r-afc--a«r-*f •aAr^Ar-aAr-Ar^aAr^Ar'Ar-aAr -Ar -aMr ^Ar -A- ^aAr <aAr * * -*r * Tfc- -A- -Ar -Ar -A- 

Number of basis functions 6 

ID(s) of the variable (s) left" in model 

3 14 16 20 43 50 

regression coefficients 

1.28354595931721 - 0 1 00154101225658052 0.758893 058476497 
-1 .21415984287542 -0.618231410989467 1 .2344 850269793 

******************** 

************************************************** 

j 

EM Iteration: 13 expected post: -54.0328345444009 
************************************************** 

Number of basis functions 6 

ID(s) of the variable (s) left in model 

3 14 16 20 43 50 

regression coefficients 

1 .'2 81217 953 6199 -6 . 1185241934 90 75e- 006 0 . 75 822 3 52070402 
-1.2134621579905 -0.612781276468739 1.22967591873953 

************************************************** 

*********** ********** ***************************** 

EM Iteration: 14 expected post .: -54.06432139112 
***************************************** * * ******* 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.28009759620513 0.757715617722 854 -1.21278 912622521 
-0 . 6103 8 0 87 9961096 1 . 22 727470412 141 
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************************************************** 

**************** ************************** ******** 

EM Iteration: 15 expected post: -54.0802180622945 
************************************************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27956525855826 0.7573 84281713 778 -1.2124 0801636852 
-0 . 6092 8 9206 977176 1.226098025693 21 

************************************************** 

************************************************** 

EM Iteration: 16 expected post: -54.0881669099217 
************************************************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27931094048991 0.757188744244 91 -1.21221126091477 
-0 . 608784296 852685 1 . 22552534756029 

************************************************** 

************************************************** 

EM Iteration: 17 expected post : -54.0920771115648 
****************************** *.* ****************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27918872576943 0.757080124746806 -1.2121123 75818 04 



WO 03/034270 

114 

-0. 608548335650073 1.22524731506564 



PCT/AU02/01417 



******* ****************** ************************* 

************************************************** 

EM Iteration: 18 expected post: -54.0939910705.254 
************************************************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27912977236735 0.757022 055016955 -1.2120632265046 
-0 .60843 7260261357 1 . 22511245075764 

************************************************** 

**************************************************. 

EM Iteration: 19 expected post : -54.0949258560397 
************************************************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27910127561155 0.756 9 91793 5789 -1.2120389492013 
-0.608384684306891 1.22504705594823 

************************************************** 

************************************************** 

EM Iteration: 20 expected post: -54.0953817354683 
************************************************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 '43 50 
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regression coefficients 

1.27908748612817 0.756976302198781 -1.21202700813484 
-0.608359689289364 1.2250153522 8942 

**************************************** * * ******** 

******************************* * * ***************** 

EM Iteration: 21 expected post: -54.0956037952427 
************************************************** 

Number of basis functions" 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27908080980647 0.7569684730 84121 -1.2120211533003 5 
-0.608347764395965 1.2249999841173 

****** ************ ************* ************* ****** 

***************************** * * ******************* 

EM Iteration: 22 expected post: -54.0957118531261 
************************************************** 

Number of basis functions 5 

ID(s) of the variable (s) left in model 

3 16 20 43 50 

regression coefficients 

1.27907757649105 0.756 964553746695 -1.21201828 961347 
-0 .608 342 0 58719735 1 . 22499253518 53 



Example 8: Survival Analysis with a parametric hazard 
The data is 1694w.dat from. 

http://www.wpi.edu/-mhchen/survbook/. This is data on 
survival of melanoma. There are n=255 individuals, 100 of 
whom have censored survival times. Each individual has 
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four covariates, namely treatment, thickness/ age and 
sex. To illustrate the methodology we added 4000 dummy 
genes to this data set to give a data matrix with 4004 
columns and 255 rows. By design the 4000 "genes" are not 
associated with survival time. Algorithmic ally , the 
challenge is to identify the important variables from 
4 004 potential predictors, most of which carry no 
information. The data were analysed using a parameteric 
Weibull model for the hazard function. 

The algorithm selected only on variable: age. All of the 
pseudo gene variables were discarded rapidly. The 
Weibull shape parameter was estimates as 0.68. 

Example 9: Ordered Categorical Analysis for prostate 
Cancer 

The example is from Dhanasekaran et al 2001. See also 
http: / /www. nature .com/cgi- 

taf /DynaPage. taf ?f ile=/nature/j ournal./v412/n6849/f ull/412 
822a0__fs.html 

and the Supplementary files at 
http://www.naturexom^ 

There are 15 samples (individuals) with 9605 genes. 
Missing values were replaced by row means + column means 
minus the grand mean. There were four ordered categories 
(G=4 ) namely 

1 . NAP normal 

2 . BPH benign 

3 . PCA localised 
4. MET metastasised 
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The algorithm found 1 gene (gene number 6611, their 
accession ID R31679) which could correctly classify all 
the individuals apart from 1 misclassif ication. 

The iterations from the EM algorithm are as follows: 

*********** ********************* 

Iteration 1 : 10 cycles, criterion -6.346001 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 0 22 

row =true class 

Class 1 Number of basis functions in model . : 9608 

********************************** ************* 

Iteration 2 : 5 cycles, criterion -13.21228 
misclassif ication matrix 

fhat 
f 1 2 

1 22 1 

2 1 21 

row =true class 

Class 1 Number of basis functions in model : 6127 
*********************************************** 

Iteration 3 : 4 cycles, criterion -14.11706 
misclassif ication matrix 

fhat 
f 1 2 

1 22 1 

2 2 20 

row =true class 

Clas.s 1 Number of basis functions in model : 3 59 
**************** ******************************* 
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Iteration 4 : 4 cycles, criterion -12.14269 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 2.20 

row =true class 

Class 1 Number of basis functions in model : 44 



*********************************************** 
Iteration 5 : 5 cycles, criterion -9.134629 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 Number of basis functions in model ; 18 
*********************************************** 

Iteration 6 : 5 cycles, criterion -6.549706 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 Number of basis functions in model -. 
************************ ****** ***************** 

Iteration 7 5 cycles, criterion -4.988667 

misclassif ication matrix 
fhat 
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f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 408 6614 7191 8077 
. regression coefficients 
16.0404 8.799716 4.196934 -0.004482982 -9.059594 
0.01061934 -1 .24506ie-09 

Iteration 8 : 5 cycles, criterion -4 . 278911 
misclassif ication matrix 

fhat 
f 12 

1 23 . 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 408 6614 7191 
regression coefficients 

20.00335 10.90405 5.268265 - 1 . 99644 le- 05 -11.30149 
0.001403909 

Iteration 9 : 4 cycles, criterion -3.980305 
misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row -true class 

Class 1 : Variables left in model 
12 3 408 6614 7191 
regression coefficients 
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22.18902 12.03594 5.834313 -3 . 711782e-10 -12.53288 
2.460434e-05 

*********************************************** 

Iteration 10 : 4 cycles, criterion -3.860487 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

23.18785 12.54724 6.089298 -13.09617 7.553351e-09 
************************************* 

Iteration 11 : 4 cycles, criterion -3.813712 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 

regression coefficients 

23.60507 12.76061 6.1956 -13.33150 

*********************************************** 

Iteration 12 : -3 cycles, criterion -3.795452 
misclassif ication matrix 
fhat 
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f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 

regression coefficients 

23.7726 12.84627 6.238258 -13. 42600 
*********************************************** 

Iteration 13 : 3 cycles, criterion -3.788319 
misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 121 

row =true class 

Class 1 : Variables left in model 
12 3 6614 

regression coefficients 

23.83879 12.88010 6.255108 -13.46334 

*********************************************** 

Iteration 14 : 3 cycles, criterion -3.785531 
misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left. in model 
1 2 3 6614 

regression coefficients 

23.86477 12.89339 6.261721 -13.47800 
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Iteration 15 : 3 cycles, criterion -3.784442 
misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 

regression coefficients 

23.87494 12.89859 6.26431 -13.48373 

****************************^ 

Iteration 16 : 2 cycles/, criterion -3.784 016 
misclassif ication matrix ' 

fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 

regression coefficients 

23.87892 12.90062 6.265323 -13.48598 

********* ************************************** 

Iteration 17 : 2 cycles, criterion -3.783849 
misclassif ication matrix 

fhat 
f 1 2 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
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12 3 6614 

regression coefficients 

23.88047 12.90142 6.265719 -13.48686 

*********************** ************************ 

Iteration 18 :. 2 cycles, criterion -3.783784 
misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 

regression coefficients 

23.88108 12.901*73 6.265874 -13.48720 

Final misclassif ication table 

pred 
y 1 2 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 

Estimated theta 

23.881082 12.901727 6.265874 
Estimated beta 

-13.48720 

A plot of the fitted probabilities is given in Figure 6 
below. The lines denote classes as follows: dashed line 
=classl ', solid line = class 2, dotted line = class 3, 
dotted and dashed line = class 4. Observations (index) 1 
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to 3 were in. class 2, 4 to 7 were in class 1, 8 - li were 
in class 3 and 12 to 15 were in class 4. 

Example 10: Ordered Categorical Analysis for prostate 
Cancer - Selected Genes 

This example is identical to that of Example 9, with the. 
exception that the data set has been reduced to 50 
selected genes. One of these genes is the gene found 
significant in example 9, the others were selected at 
random. The purpose of this example is to provide an 
illustration based on a completely tabulated data set 
(Table 10) . 



Missing values were replaced by row means + column means 
minus the grand mean. There were four ordered categories 
(G=4) namely 

1 . NAP normal 

2 . BPH benign 

3 . PCA localized 

. 4. MET metastasised 

The algorithm found one predictive gene (gene 1 of table 
10) , which was equivalent to gene 6611 (Accession 
R31679) of Example 9. The prediction success was, of 
course, identical to that of example 9 (since it was 
based upon the same single gene) . 
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Table 10: Disease Stage and Gene Expression for Selected Genes 



1 



Disease Stage 
1 3 3 



3 



1 1 .6520 1 .1480 0,8600 2.2490 3.0190 4.0320 1.8900 0.9430 0.8890 0.7960 0,6340 0.1 040 0.2040 0.2740 0.0830 

2 1.0464 1.7040 1.0655 1.0860 1.0133 1.0509 1.0006 1.0568 1.02861 .10600.9700 1.10160.60200.8080 1.0843 

3 ia 4021^1.2S94afl8301.^1^1.19451i5(fr1J22»0^1^1^1^1-^1^ 

4 0.49900.71000.72300.67000.71900.55200^ 

5 1.43241.12301.45161.13501.53401.32901.38672.35901.67701.24301.24501.46201.39501.35101.33^^ 

6 0.98000,95801.01001.18001.03601.06101.30300.66100.99131.02^ 

7 1.77841.90601.79760.84001.05201.45001.05604.75702^ 

8 0.8440 1 .0800 1 .10700.6570 1.0240 0.7510 1.1790 1 .1830 1 .0329 1 .3040 1 .1200 0.8010 1 .31 10 0.9640 1 .3790 

9 1.3625 1.67501 .42200.94000.9850 1.8830 1.3168 1.3730 1.34481.3744 1.3290 1.41771.5270 1.3724 1.103(J 

10 O.mi 0.78500.55500.86900.61100.54100.75300.8450 

11 1.12841.11851.14751.09531.09531.13291.08261.13881.11061.14021,09491.18361.15411.13831,1280 

12 0.85801.18251.05001.45601.06300.84701.08102.88901,15501.20421.04901.03100.99401.20230.8310 

13 0.90300.96000.76501.20300.92901.28300.98000.99231.04800.81001.00600.93700.91200.98701.0170 

14 1.70002.06401.99002.12901.83801.90301.65901.86201.52002.01301.24401.25000.93602.26000^ 

15 0.86900.79300.82101.00600.83100.84100.82500.82900.86430.90800.82500.93730.72800.89201.3040 

1 6 0.9720 1 .0620 1 .10400.8750 1 .0280 0.98900.9260 0.8670 1 .1260 1 .2760 0.9860 0.8640 1 .3490 1 .5980 1.5790 

17 0.98201.84101.07902.45100.91301.53800.97900.8130 0.87501.19191.1465 0.91501.20581.18990.5900 

1 8 0.5040 0.7860 0.6460 0.7280 0.8910 0.6320 0.8390 0.491 0 1 .0340 0.6880 0.6200 0.3890 0.4400 0.61 1 0 0.4640 

19 1.1427 15020 1 .3440 1.0730 1.1840 1.1472 1.0970 1.1532 1.12501.15461.1092 1.1979 1.1685 1.1160 0.9350 

20 1.2235 1.21360.54701.19040.52300.54001.06201.2339 1.20570.65900.6020 5.08300.8880 1.2820 1.0450 

21 0.4920 0.7360 0.6500 0.6520 0.5910 0.56100.7050 0.6170 0.6860 0.7080 0.7410 0.51700.9250 1 .0530 1 .61 id 

22 1.0880 0.71800.81700.9870 0.6760 1 .2960 0.7440 0.5040 0.71000.5290 0.6840 0.59700.4910 0.5040 0.4740 

23 0.8035 0.6580 0.82260.77050.58000.77300.7578 0.8140 0.7858 0.6750 0.7770 0.85870.8430 0.9720 1.1470 

24 2.1 321 2.4360 2.7240 1 .6260 2.2290 2.7950 2.0864 2.7400 2.2740 2.1490 1 .3600 3.01 10 1 .4560 1 .0580 1 .8450 

25 0.8875 0.7710 0.8860 0.7840 0.9430 0.7260 0.9860 0.8980 0.8698 0.9440 0.7230 0.9428 1.1010 0.8320 1.0630 

26 1.0330 1.0140 1 .0050 1.03300.9580 1.13800.8830 0.7020 0.8170 0.8365 0.7400 0.61600.4830 0.5420 0.5750 

27 0.8324 0.8225 0.851 5 0.7993 0.7993 0.8369 0.8470 0.8428 0.8146 05880 0.7989 0.88760.8581 1 .361 0 0.8703 

28 0.6400 0.8610 0.7840 0.9300 0.7740 0.7460 0.8090 0.8980 0.9080 0.7800 0.81 80 1 .3400 0.9380 0.8500 0.9690 
,29 1.13400.89400.90300.93200.91300.96300.93701.0760 1.00200.71600.99700.87900.89800.9820 1.4650 

30 0 J230 0.641 0 0.4990 0.7190 0.6390 0.5680 0.6970 0.7320 0.6130 0.5620 0.8380 0.7782 0.7340 0.9250 1.2270 

31 1 .6570 1 .0600 1 .4730 1 . 1 390 1 .31 30 1 .2250 1 .0770 0.7370 0.8930 0.9840 0.8300 1 : 1 270 0.6860 0.9930 0.5080 

32 0.5460 0.5370 0,4830 0.8570 0.5820 0.55600.7520 0.6900 0.8480 0.7360 0.621 0 0.6410 0.741 0 0.6990 1 .2750 

33 0.87920.64500.56000.92700.79501.11200.83350.8897 0.86150.89110.80700.93450.9170 1.1900 0.9570| 
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Disease Stage 
1 3 3 



34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 
50 



0.8244 6.6000 1 .1 050 0.9200 0.9440 0;82890.9430 0.8020 0.8067 0.7580 0.6690 0.8797 0.9030 0.5890 0.8320 
0.91 60 0.7790 0.7770 1 .2340 0,7430 1 ,1 970 0;7860 0.6580 0.8250 0.39200.5450 0.8440 0.5240 0.631 0 0.7320 
0.8912 0.5390 0^7970 1 .1880 0 6820 0.7010 0.8760 0.9970 0.8000 0.9060 0.8980 1 .1740 1.0260 0.7550 1 .1330 
1 . 1840 1 .2700 1 .4890 0.8670 1 .2400 1 .2230 1 .0870 1 .2670 1 .3870 1 .91 00 1 .2300 1 .21 90 1.2703 0.81 50 1 .2300 
1.13041.12051.14951.09731.09731.50900.64001.14081.11261.14221.09691.18 1.14031.2410 
1,48572.0350 1.5048 1.1970 1.9620 1.5820 1.7220 1.9630 1.4430 1.8120 1.9020 1.28560.88400.96200.5600 
1 .9650 1 .6730 1 .771 0 1 .4780 1 .3830 1 .7990 1 .0340 0.7250 0.7970 0.7560 1 .0390 0.4410 0.4940 0.7770 0.4780 
0.81 11 0 0.9690 1 .0640 1 .0330 0.7280 0.781 0 0.8790 0.9281 0.7830 0.9610 1 .1 140 1 .2200 0.7270 0.9320 0.8390 
1 ,5686 2,6710 2.6750 1 .3670 1 .2040 1 .7650 1 .4580 1 .5791 1 .0230 1 .581 0 2.0400 1 .8630 1 .0030 1 .1 640 0.5730 
0.9814 0.9715 1.00050.9483 0.9483 0.98590^9810 0.9918 0;963 6 0.9932 0.9479 1 .03661.0071 0.9913 1.0193 
1.1596 1.7120 1.1760 1.1980 1.3410 1.00801.11391.2340 1.14191.17151.1262 0.83101.18541.16950.7740 
0.9870 1 .1340 1 .2600 0.8850 1 .0880 0.8450 1.0060 0.9790 1 .0850 1 .1040 1 .2680 2.43000.9370 0.8080 0.991 0 
1 . 1 520 1 .0002 0.9720 0.7860 1 .1 950 0.9610 1 .0550 0.9800 0.9923 0.8080 1 .0300 1 ,0653 1 .041 0 1 .21 10 0.9250 
0.9300 0.91 54 0.8450 0.561 0 0.8790 0.731 0 0.8796 0.71 20 0.9076 1 .0470 0.9990 0.9805 0.9450 1 .2500 1 .2750 
0.5700 0.7360 0.5800 0.7800 0.5720 0.841 8 0.6720 0.6990 0.81 96 0.8960 0.8450 0.8570 1 .2200 1 .2220 1 .231 0 

1.49001.33400.98001.16651.03201.26901.13101.21001.18181.21141.09801.33801.35201.10201.0650 
1.4590 1.2200 1 .4490 1 .7810 1.7520 1 .3200 1 .221 0 1 .0720 1,1140 1.4820 1.11600.54100.4620 0.4S40 0.4910 



Table E4 : Disease Stage and Gene Expression for Selected 

Gene 

EXAMPLE 11 Apparatus for use of the method. 



10 



15 



Referring to Figure 5, a personal computer 2 0 suitable for 
implementing methods according to embodiments of the 
present invention is shown. Computer 20 operates under 
the instruction of a software program stored on hard disk 
data storage device 21. Computer 20 further includes a 
processor 22, memory 23, display screen 24, printer 25 and 
input devices mouse 26 and keyboard 27. The computer may 
have communication means such as a network connection 27 
to the internet 28 or data collecting means 28 to 
facilitate downloading or collection and sharing of data.. 
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The data collection means collects or downloads data from 
a system. The computer includes a manipulation means 
embodied in software which communicates with mouse 2 6 and 
keyboard 27 to allow a user to implement the method 
5 according to the embodiments , of the invention on the data. 
The systems includes a means embodied in the software to 
implement the method according to the embodiments of the 
present invention, and means to create a graphic. After, 
the method has been implemented, the output may be 
10 illustrated graphically on display screen 24 and/or 
printed on printer 25. 

In the above examples, implementation of the invention has 
been described in relation to a biological system. As 

15 discussed previously, the invention may be applied to any 
"system" requiring features of samples to be predicted. \ 
Examples of systems include chemical systems, agricultural 
systems, weather systems, financial systems including, for 
example, credit risk assessment systems, insurance 

2 0 systems, marketing systems or company record systems, 

electronic systems, physical systems, astrophysics systems 
and mechanical systems . 

Modifications and variations as would be apparent to a 
2 5 skilled addressee are deemed to be within the scope of the 
present invention. 
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CLAIMS 

1. A method for identifying a subset of components of a 
system, the subset being capable of predicting a feature 
5 of a test sample, the method comprising the steps of; 

(a) generating a linear combination of components and 
component weights in which values for each component 
are introduced from data generated from a plurality 
of training samples, each training sample having a 

10. known feature ; 

(b) Defining a model for the probability distribution of 
a feature wherein the model is conditional on the 
linear combination and wherein the model is not a 
combination of a binomial distribution for a two 

15 class response with a probit function linking the 

linear combination and the expectation of the 
response; 

(c) constructing a prior distribution for the component 
weights of the linear combination comprising a 

20 hyperprior having a high probability density close to 

zero; 

(d) combining the prior distribution and the model to 
generate a posterior distribution; 

(e) identifying a subset of components having component 
25 weights that maximise the posterior distribution. 

2. The method of claim 1 wherein the model is a likelihood 
function based on a model selected from the group 
comprising a multinomial or binomial logistic regression, 
30 generalised linear model, Cox's proportional hazards model 
and parametric survival model. 



3 . The method of claim 1 or 2 wherein the model is a 
likelihood function based on a multinomial or binomial 
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-logistic regression. 



The method of claims 2 or 3 wherein the logistic, 
regression models a feature having a multinomial or 
binomial distribution. 



10 



The method of any one of claims 1 to 4 wherein the 
subset of components is capable of classifying a 
sample into one of a plurality of pre-defined groups 
by defining a logistic regression which comprises 
grouping the samples into a plurality of sample 
groups, each sample group having a common group 
identifier. 



15 6. The method of any one of claims 1 to 5 wherein : the 
logistic regression is of the form: 



n 



G-t 



i+2> 

V 8=1 







\ 




1 




14 


G-\ T 


> 






J 



20 



25 



30 



wherein 

xf P g is a linear combination generated from input data 

from training sample i with component weights fi g ; 

xf is the components for the i th Row of X and [3 g is a 

set of component weights for sample class g; 

ei g =l if training sample i is a member of class gr, =0 

otherwise; 

and 

X is data from n training samples comprising p 
components. 

The method of claim 1 or 2 wherein the subset of 
components is capable of classifying a sample into a 
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class wherein the class is one of a plurality of 
predefined ordered classes, by defining a logistic 
regression which comprises defining a series of group 
identifiers in which each group identifier corresponds 
to a member of an ordered class > and grouping the 
samples into one of the ordered classes . 

The method of claim 7 wherein the logistic regression 
is of the form: 



10 



N G-i f v V* f 

Tik 



*=nn 



/=i k=\\Tik+\J 
log it 



V Tik+\ J 



=log 




=e k +xfp 



Wherein 

15 Di k is the probability that training sample i belongs to 

a class with identifier less than or equal to k (where 
the total of ordered classes is G ) ; 

^ r D a is a linear combination generated from input data 
from training sample i with component weights 0°; 
2 0 X is data from n training samples comprising p 

components; 

xf is the- components for the i th Row of X ; 

rij is as defined as 
J 

25 



where 
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'(/ ~\0, oth 



if observation i in class j 
otherwise 



9. The method of claim 1 or 2 wherein the model is a 

5 likelihood function is based on a generalised linear 

model . 

10. The method of claim 9 wherein the generalised linear 
model models a feature that is distributed as a 

10 regular exponential family of distributions. 

11. The method of claim 10 wherein ..the regular exponential 
family of distributions is selected from the group 
consisting of normal distribution, Gaussian 

15 distribution, Poisson distribution, exponential 

distribution, gamma distribution, Chi Square 
distribution and inverse gamma distribution. 

12. The method of claim 1 or 2 wherein the subset of 
2 0 components is capable of predicting a predefined 

characteristic of a. sample by defining a generalised 
linear model which comprises modelling the 
characteristic to be predicted. 

25 13. The method of claims 3 or 10 wherein the generalised 
linear model is of the form: 



m <*,-(q>) 

3 0 Wherein 

y = (yz/— f yn) T , and yi is the characteristic 
measured on the i th sample; 

ai(<£) = <f> /wi with the wi being a fixed set of known 
weights and 0 a single scale parameter; 
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the functions b(.) and c(.)are as defined by Nelder 

and Wedderburn (1972) ; 

E{y i }=b'(6 i ) 

Var{y}=b"(e i )a i (( P ) = Tfa i ((p); 

and wherein , each observation has a set of covariates 
5 . xi and a linear predictor Vi = Xj. T /3 . 

14 . The method of claim 1 or 2 wherein the model is a 
likelihood function based on a model selected from the 
group consisting of Cox's proportional hazards model , 

10 parametric survival model and accelerated survival 

times model. 

15. The method of claim 1 wherein the subset of components 
is capable of predicting the time to an event for a 

15 sample by defining a likelihood based on Cox's 

proportional standards model, a parametric survival 
model or an accelerated survival times model, which 
comprises measuring the time elapsed for a plurality 
of samples from the time the sample is obtained to the 

20 time of the event. 

16. The method of claim 14 wherein Cox's proportional 
hazards model is of the form: 

25 



£ exp(z i fi) 



Wherein 

X is data .from n training samples comprising p 
components ; 
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Z is a matrix that is the re -arrangement of the 
rows of X where the ordering of the rows of Z 
corresponds . to the ordering induced by the 
ordering of the survival times; 

d is the result of ordering the censoring index 
with the same permutation required to order 
survival times . 

Zj is the j th row of the matrix Z and- dj is the 
j th element of d ; 

9t ^ = = j,j + l,";N}= the risk set at the j 
ordered event time t(j} ; 

17. The method of claim 14 wherein the parametric hazards 
model is of the form: 



th 



15 



1=1 



log 



' *M T 



20 



25 



where 

M i = A(y i ;<p)exp{x i fi); 

9=1 if the i th sample is uncensored and q =0 if the 
i th sample is uncensored; 

The functions and A(.) are as defined by Aitkin 

and Clayton (1980) ; 

X± is the i th row of X and X is data from n training 
samples comprising p components; 
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18- The method of any one of claims 1 to 17 wherein the 
prior distribution is of the form: 

P^)=jp(fi\v i ) P (v 2 )dv' 

Where p^v 2 ) is #(o,diag{v 2 }) ; 

v is a hyper parameter; 
5 p(v*) is ^ hyperprior distribution. 

19. The method of any one of claims 1 to 18 wherein the 
hyperprior is a Jeffreys prior of the form^ 

p(v j )«n.i/v 2 

10 

20. The method of any one of claims 1 to 19 wherein 
posterior distribution is of the form: 

p(j3<pv\y) a L(y\j3<p)p(/3\v 2 )p(v 2 ) 

15 wherein Z^v|/?,<pJ is the likelihood function. 

21. The method of any one of claims 1 to 20 wherein the 
posterior distribution is maximised using an iterative 
procedure. 

20 

22. The method of claim 21 wherein the iterative procedure 
is an EM algorithm. 

23. The method of any one of claims 1 to 22 wherein the 
25 system is a biological system. 

24. The method of claim 23 wherein the biological system 
is a biotechnology array. 



30 



25. The method of claim 24 wherein the biotechnology array 
is selected from the group consisting of DNA array, 
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protein array, antibody array, RNA array, carbohydrate 
array, chemical array, lipid array. 

26. A method for identifying a subset of components of a 
5 subject which are capable of classifying the subject 

into one of a plurality of predefined groups wherein 
each group is defined by a response to a test 
treatment comprising the steps of: 

i" 

10 (d) exposing a plurality of subjects to the test 

treatment and grouping the subjects into response 
groups based on responses to the treatment; 

(e) measuring components of the subjects; 

(f) identifying a subset of components that is capable of 
15 classifying the subjects into response groups using 

the methods according to any one of claims 1 to 28. 

27. The method of claim 26 wherein the components are 
selected from the group consisting of genes, small 

2 0 nucleotide polymorphisms (SNPs) , proteins, antibodies, 

carbohydrates, lipids . 

28. An apparatus for identifying a subset of components of 
a system from data generated from the system from a 

25 plurality of samples from the system, the subset being 
capable of predicting a feature of a test sample, the 
apparatus comprising; 

(a) means for generating a linear combination of 

3 0 components and component weights in which values for 

each component are introduced from data generated 
from a plurality of training samples, each training 
sample having a known feature; 

(b) means for defining a model for the probability 
3 5 distribution of a feature wherein the model is 

conditional on the linear combination and wherein the 
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model is not a combination of a- binomial distribution 
for a two class response with a probit function 
linking the linear combination and the expectation of 
the response; 

means for constructing a prior distribution for the 
component weights of the linear combination 
comprising a hyperprior having a high probability 
density close to zero; 

means for combining the prior distribution and the 
model to generate a posterior distribution; 
means for identifying a subset of components having 
component weights that fnaximise the posterior 
distribution. 

15 29. A computer program arranged, when loaded onto a 

computing apparatus, to control .the computing apparatus to 
implement a method in accordance with any one of claims 1 
to 27 . 

20 30. The computer program of claim 29 implemented with the 
method of any one of claims 1 to 27 . 

31. A computer readable medium providing a computer 
program in accordance with claim 29 or 30. 

25 

32. A method -of testing a sample from a system to identify 
a feature of the sample, the method comprising the steps 
of testing for a subset of components which is diagnostic 
of the feature-, the subset of components having been 

3 0 determined by a method in accordance with any one of 
claims 1 to 27. 

33. An apparatus for testing a sample from a system to 
determine a feature of the sample, the apparatus including 

3 5 means for testing for components identified in accordance 
with the method of any one of claims 1 to 27. 



5 (c) 
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34. A computer program which when run on a computing 
device, is arranged to control the computing device,, in a 
method of identifying components from a system which are 
5 capable of predicting a f eature of a test sample from the 
system, and wherein a linear combination of components and 
component weights is generated from data generated from a 
plurality of training samples, each, training sample having 
a known feature, and a posterior distribution is generated 

10 by combining a prior distribution for the component 

weights comprising a hyperprior having a high probability 
distribution close to zero, and a model that is 
conditional on the linear combination; wherein the model is 
not a combination of a binomial distribution for a two 

15 class response with a probit function linking the linear 
combination and the expectation of the response, to 
estimate component weights which maximise the posterior 
distribution. 

20 35 . A method for identifying a subset of components of a 
biological system, the subset being capable of predicting 
a feature of a test sample from the biological system, the 
method comprising the steps of: 

2 5 (a) generating a linear combination of components and 

component weights in which values for each component 
are determined from data generated from a plurality 
of training samples, each training sample having a 
known feature; 

30 (b) defining a model for the probability distribution of 
a feature wherein the model is conditional on the 
linear combination; 
(c). constructing a prior distribution for the component 
weights of the iinear combination comprising a 
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hyperprior having a high probability density close to 
zero ; 

(d) combining the prior distribution and the model to 
generate a posterior distribution; 
5 (e) identifying a subset of components having component 
weights that maximise the posterior distribution. 
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